djurado · b1e355f5
--- a/Home.md
+++ b/Home.md
-download here LAMMPS [lammps-23Jun2022.tar.gz](https://download.lammps.org/tars/lammps-23Jun2022.tar.gz (version used in this study)
+download here LAMMPS [lammps-23Jun2022.tar.gz](https://download.lammps.org/tars/lammps-23Jun2022.tar.gz) (version used in this study)

 # Introduction

@@ -110,7 +110,6 @@ It may be interesting to do some tests to see how higher inner loop iteration co
 | `pair_style lj/charmm/coul/long 8.0 10.0` | 375 |
 | `pair_style lj/charmm/coul/long 8.0 16.1` | 1290|

-
 ### Managing different code paths

 In *Overview of Algorithm and Data structures* we presented the different code paths in the function.
@@ -120,7 +119,7 @@ With the RISC-V vector extension, this can be overcame with the help of masked i

 For instance, which proportion of the atom pair interactions (or inner loop iterations) belong to the *do nothing* group?
 Even when using masked instructions to avoid updating *do nothing* interactions, instructions take some time to execute.
-So, as opposed to the serial version, a *do nothing* interactions has the same cost in time as any other atom in the vectorized version with masked instructions.
+So, as opposed to the serial version, a *do nothing* interaction has the same cost in time as any other atom in the vectorized version with masked instructions.

 Before starting working on the vectorization, the code was modified to count the number of interactions that belong to each category.
 The flowchart shows the average number number of interactions (for a single `i` atom in a timestep) that belong to each category, and the arrows show the same information in percentage form.
@@ -134,9 +133,34 @@ It may be interesting to test how the performance of the modified input affects
 LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
 This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.

-To provide an example, in `i,j` interactions, the identifiers of atoms `j` are placed in an array of 32 bit integers `jlist`.
+To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32 bit integers `jlist`.
 These 32 bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64 bit floating point numbers.
-This access cannot be vectorized
+This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
+
+With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
+A fixed size ` vle32.v` load with an internal SEW of 64 bits would be able to extend 32 bit integers to 64 bit.
+
+In the 0.7 VPU, this type of loads are not implemented.
+On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64 bits), altough it has several limitations.
+
+First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
+Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
+This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
+For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
+
+The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
+For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
+This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
+Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
+For this reason, this approach has not been used.
+
+In the end, a bithack trick was used to extend the array of 32 bit unsigned integers into a vector register with SEW width of 64 bits.
+The trick is to load the array as if it had 64 bit elements, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
+To get the other half, it is needed to perform a shift right logic before applying the `vand`.
+This method is very low level and depends on the endianness of the system in order to work (TODO elaborate why).
+Moreover, it also requires a bit of extra handling for the case in which 
+To see the code in detail, check annex (TODO).
+

 atom_vec.h -> contains `**x` and `**f` (3D)
 neigh_list.h -> contains `**firstneigh` (for each i, store array of neighbors j)