djurado · d1c3cc60
--- a/Home.md
+++ b/Home.md
@@ -30,7 +30,7 @@ In other words, `int **firstneigh` is a pointer to an array of pointers to array
 `firstneigh[i]` contains an array of the atoms `j` that are neighbors of atom `i`.
 In this protein input, there are 32.000 `i` atoms.
 For example, `firstneigh[i][0]` would be the first neighbor of atom `i`.
-Both `i` and `j` are atom identifiers, and are represented using a 32 bit `int`.
+Both `i` and `j` are atom identifiers, and are represented using a 32-bit `int`.
 To retrieve the properties of an atom, this `int` value needs to be used as an index for the arrays found in file `atom_vec.h` or `atom.h` such as `double **x` (position), `**f` (force) or `*q` (charge).
 `x` and `f` are pointer to pointer since the second index in needed to split the magnitudes in XYZ components.

@@ -98,7 +98,7 @@ Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`)

 Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
 After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
-Considering that registers in the 0.7 vector unit can hold up to 256 64 bit elements, the loop is suitable for vectorizing iteration count wise.
+Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop is suitable for vectorizing iteration count wise.

 One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
 This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.
@@ -128,20 +128,20 @@ Black values show data for the default protein input, while red values correspon
 The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
 It may be interesting to test how the performance of the modified input affects performance compared to the serial version.

-### Managing 32 bit and 64 data types
+### Managing 32-bit and 64 data types

 LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
 This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.

-To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32 bit integers `jlist`.
-These 32 bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64 bit floating point numbers.
+To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32-bit integers `jlist`.
+These 32-bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64-bit floating point numbers.
 This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.

 With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
-A fixed size ` vle32.v` load with an internal SEW of 64 bits would be able to extend 32 bit integers to 64 bit.
+A fixed size ` vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.

 In the 0.7 VPU, this type of loads are not implemented.
-On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64 bits), altough it has several limitations.
+On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64-bits), altough it has several limitations.

 First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
 Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
@@ -154,28 +154,54 @@ This can lead to compilation errors, since the inline assembler is not aware of
 Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
 For this reason, this approach has not been used.

-In the end, a bithack trick was used to extend the array of 32 bit unsigned integers into a vector register with SEW width of 64 bits.
-The trick is to load the array as if it had 64 bit elements, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
+In the end, a bithack trick was used to extend the array of 32-bit unsigned integers into a vector register with SEW width of 64-bits.
+The trick is to load the 32-bit array into a register with 64-bit SEW, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
 To get the other half, it is needed to perform a shift right logic before applying the `vand`.
 This method is very low level and depends on the endianness of the system in order to work (TODO elaborate why).
 Moreover, it also requires a bit of extra handling for the case in which the array has and odd number of elements.
 To see the code in detail, check annex (TODO).

+It is important to check if a unaligned memory access exception can be produced with vector loads.
+For instance, to place a 32-bit array inside a 64-bit register, performing a load with a SEW of 64 eventually produces an unaligned access exception.
+For this reason, it is needed to perform the memory load with a SEW of 32, and the move the contents to a 64-bit register using `vmv.v.v` (as with TODO union_int_float_t in section X).
+
 ### Handling union_int_float_t

-`PairLJCharmmCoulLong::compute` defines an union type `union_int_float_t` in order to be able with operate with the bits of a 32bit floating point number `rsq` as if it was a 32 bit integer.
-In particular, the least significative part of the exponent combined to the most significative part of the mantissa is extracted and used as index for an array, but the bitmask and the shift value is not static, it is instead generated by LAMMPS based on the the input configuration.
+`PairLJCharmmCoulLong::compute` defines an union type `union_int_float_t` in order to be able with operate with the bits of a 32bit floating point number `rsq` as if it was a 32-bit integer.
+Since the `union` construct cannot be used with RISC-V intrinsic data types, an inline assembler block with a `vmv.v.v` instruction was used instead.
+the code snippet shows how the data can be successfully moved from a floating point data type to an integer data type.
+
+```c++
+__epi_1xf64 vecrsq;
+__epi_1xi64 vecitable;
+// vecitable <- vecrsq
+__asm__ (
+"nop\n\t"
+"vmv.v.v %0, %1\n\t"
+: "=vr" (vecitable)
+: "vr" (vecrsq)
+);
+```
+
+The downside of this `vmv.v.v`method is that it adds a redundant instruction, since is not actually needed to move data from a register to another, it should be enough just to interpret it as another type.
+There is an intrinsic that allows converting between mask and integer vector without performing any operation, but not between integer and floating point, although it follows the same underlying principle (conversion between data types with same SEW).

-In our optimized version, `rsq` is a 64 bit number, so a small extra piece of code is needed to be able to convert the 32bit bitmask and shift value (generated in `pair.cpp:init_bitmap`) to 64bit.
-This can be done easily by considering the differences 
+After the conversion, then, some bit manipulation is performed on the result.
+This becomes an issue, since in the code it was performed with 32-bit data types, and we use 64-bit data types.
+To overcome this, we need to convert the bitmask and the shift value from 32 to 64 bits.
+It cannot be done statically, since these values are generated in generated in `pair.cpp:init_bitmap` based to the settings specified in the input, so a small extra piece of code is needed.

-Since the extracted bit fields are centered at the start of the 
+Since the extracted bit fields are centered at the start of the mantissa, in order to convert the mask and shift value to 64-bit, it is enough to shift and add `(DBL_MANT_DIG - FLT_MANT_DIG)`, which corresponds to the increase of size of the mantissa when switching from single to double precision.
+Here, we show the code snippet with the transformation to the mask and shift value and a representation of the bit manipulation performed.

 ```
 ncoulmask64 = ((long) ncoulmask) << (DBL_MANT_DIG - FLT_MANT_DIG);
 ncoulshiftbits64 = ncoulshiftbits + (DBL_MANT_DIG - FLT_MANT_DIG);
 ```

+![lammps_union_mask](uploads/483b72359d0e41d0b754be1518ff2383/lammps_union_mask.png)
+
+
 atom_vec.h -> contains `**x` and `**f` (3D)
 neigh_list.h -> contains `**firstneigh` (for each i, store array of neighbors j)
 PairLJCharmmCoulLong::settings ->