djurado · 3fb12e94
--- a/32-bit-and-64-bit-data-types.md
+++ b/32-bit-and-64-bit-data-types.md
+LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
+This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
+
+To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32-bit integers `jlist`.
+These 32-bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64-bit floating point numbers.
+This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
+
+With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
+A fixed size ` vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.
+
+In the 0.7 VPU, this type of loads are not implemented.
+On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64-bits), altough it has several limitations.
+
+First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
+Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
+This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
+For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
+
+The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
+For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
+This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
+Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
+For this reason, this approach has been discarded.
+
+In the end, a bithack trick was used to extend the array of 32-bit unsigned integers into a vector register with SEW width of 64-bits.
+The trick is to load the 32-bit array into a register with 64-bit SEW, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
+To get the other half, it is needed to perform a shift right logic before applying the `vand`.
+In addition, this method also requires a bit of extra handling for the case in which the array has and odd number of elements.
+The following figure shows a representation of the operations needed for the 32-bit to 64-bit conversion.
+To see the code in detail, check annex (TODO).
+
+![evenodd](uploads/82a21f4b6d4321ea2481d04fa9818f9e/evenodd.png)
+
+It is important to check if a unaligned memory access exception can be produced with vector loads.
+For instance, to place a 32-bit array inside a 64-bit register, performing a load with a SEW of 64 produces an unaligned access exception if the starting address is not aligned.
+For this reason, it is needed to perform the memory load with a SEW of 32, and then move the contents to a 64-bit register using `vmv.v.v` (as with TODO union_int_float_t in section X).
\ No newline at end of file