djurado · 276aa45c
--- a/Home.md
+++ b/Home.md
@@ -33,7 +33,7 @@ After considering this analysis, we decided on writing an specialized routine th
 Now, there are two routines, `compute_loopi_original` and `compute_loopi_special`.
 The specialized routine is called when possible, if not, execution falls back on the original function.

-Another factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
+The last factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
 Targeting only this case for the specialization can lead to simpler code, altough it would fall back on the original function when used with three parameters.
 For more information about the two and three parameter invokation, check the LAMMPS [documentation](https://docs.lammps.org/pair_charmm.html#pair-style-lj-charmm-coul-long-command).

@@ -69,15 +69,23 @@ Vectorization is based on SIMD processing (single instruction, multiple data), b
 With the RISC-V vector extension, this can be overcame with the help of masked instructions, which allows restricting writing the result of a vector instructions to only certain elements using a bitmask.

 For instance, which proportion of the atom pair interactions (or inner loop iterations) belong to the *do nothing* group?
-Even when using masked instructions to avoid updating *do nothing* interactions, instructions take some time to execute.
+Even when using masked instructions,  we can avoid updating data for *do nothing* interactions, but the execution time required for processing this data cannot be avoided.
 So, as opposed to the serial version, a *do nothing* interaction has the same cost in time as any other atom in the vectorized version with masked instructions.

 Before starting working on the vectorization, the code was modified to count the number of interactions that belong to each category.
 The flowchart shows the average number number of interactions (for a single `i` atom in a timestep) that belong to each category, and the arrows show the same information in percentage form.
 Black values show data for the default protein input, while red values correspond to the modified input described in section *Loop size*.

+We can see how the proportion of "do nothing" elements in the regular input is about 42%.
+We deemed to extract the not "do-nothing" elements would be too costly, since the proportion is too high, and the accelerator lacks the `vcompress` [instruction](https://github.com/riscv/riscv-v-spec/blob/0.7.1/v-spec.adoc#176-vector-compress-instruction) that implements this (see [ISA support](https://repo.hca.bsc.es/gitlab/EPI/RTL/Vector_Accelerator/-/wikis/VPU/ISA-support)).
+For this reason, we decided to use the masking approach, even if it makes "do nothing" elements as slow as  the rest.
+
+This type of "masking" approach is not suitable for the elements labeled as "slow" (the ones involving `sqrt` and `exp`), since all elements would need a computation time of "slow" and "fast" combined.
+The fact that there are so few "slow" elements (around 0.3%) makes it possible to try to use the "vextract" method.
+Since the instruction is unavailable, we used a loop of `
 The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
-It may be interesting to test how the performance of the modified input affects performance compared to the serial version.
+It may be interesting to test how the modified input affects performance compared to the serial version.
+

 ### Managing 32-bit and 64 data types