djurado · 068f1f07
--- a/Home.md
+++ b/Home.md
@@ -22,7 +22,7 @@ In this section we discuss the implementation of the optimized `PairLJCharmmCoul

 ### Specialization

-The end of `PairLJCharmmCoulLong::compute` contains function calls to `ev_tally` or `virial_fdotr_compute`, which are run depending on the value or `evflag` and `vflag_fdotr` variables.
+The end of `PairLJCharmmCoulLong::compute` contains function calls to `ev_tally` or `virial_fdotr_compute`. Whether these are run or not depends on the value or `evflag` and `vflag_fdotr` variables.
 An analysis using GDB breakpoints showed that, for the protein input, these funcions are only called on the first and last timesteps of the execution.
 This paraver trace shows the weight of the first and last iterations compared to the rest.

@@ -33,17 +33,18 @@ Having function calls inside the function to be optimized can be troublesome bec
 2. If not autovectorized, the function would need to be vectorized with intrinsics, with all the additional work.
 3. If kept serial, then a mechanism to unpack data from the vector registers would still be needed.

-After considering this, we decided that the specialized function should only target the case in which the functions are not called.
-Now, there are two funcions, `compute_loopi_original` and `compute_loopi_special`.
-The specialized function is called when possible, and if not the execution falls back on the original function.
+After considering this analysis, we decided on writing an specialized routine that only targets the case in which the functions are not called.
+Now, there are two routines, `compute_loopi_original` and `compute_loopi_special`.
+The specialized routine is called when possible, if not, execution falls back on the original function.

 Another factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
-Targeting only this case for the specialization this case for the optimization can lead to simpler code.
+Targeting only this case for the specialization can lead to simpler code, altough it would fall back on the original function when used with three parameters.
+For more information about the two and three parameter invokation, check the LAMMPS [documentation](https://docs.lammps.org/pair_charmm.html#pair-style-lj-charmm-coul-long-command).

 ### Loop size

 When vectorizing a loop, it is important to consider the number of loop iterations.
-It may not be worth vectorizing a loop that features a very small iteration count with underusing of the vector registers, since it may prove slower than the serial version.
+It may not be worth vectorizing a loop that features a very small iteration count, since it can lead to underusing of the vector registers and may prove slower than the serial version.

 `PairLJCharmmCoulLong::compute` deals with interactions of pair of atoms `i,j`.
 The outer loop that traverses through the 32000 `i` atoms in the protein input.
@@ -52,7 +53,7 @@ Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`)

 Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
 After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
-Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop is suitable for vectorizing iteration count wise.
+Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop can be considered suitable for vectorizing iteration count wise.

 One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
 This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.