|
|
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
|
|
|
This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
|
|
|
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32-bit integers `jlist`.
|
|
|
These 32-bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64-bit floating point numbers.
|
|
|
This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
|
|
|
|
|
|
With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
|
|
|
A fixed size ` vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.
|
|
|
|
|
|
In the 0.7 VPU, this type of loads are not implemented.
|
|
|
On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64-bits), altough it has several limitations.
|
|
|
|
|
|
First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
|
|
|
Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
|
|
|
This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
|
|
|
For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
|
|
|
|
|
|
The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
|
|
|
For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
|
|
|
This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
|
|
|
Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
|
|
|
For this reason, this approach has been discarded.
|
|
|
|
|
|
In the end, a bithack trick was used to extend the array of 32-bit unsigned integers into a vector register with SEW width of 64-bits.
|
|
|
The trick is to load the 32-bit array into a register with 64-bit SEW, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
|
|
|
To get the other half, it is needed to perform a shift right logic before applying the `vand`.
|
|
|
In addition, this method also requires a bit of extra handling for the case in which the array has and odd number of elements.
|
|
|
The following figure shows a representation of the operations needed for the 32-bit to 64-bit conversion.
|
|
|
To see the code in detail, check annex (TODO).
|
|
|
|
|
|
![evenodd](uploads/82a21f4b6d4321ea2481d04fa9818f9e/evenodd.png)
|
|
|
|
|
|
It is important to check if a unaligned memory access exception can be produced with vector loads.
|
|
|
For instance, to place a 32-bit array inside a 64-bit register, performing a load with a SEW of 64 produces an unaligned access exception if the starting address is not aligned.
|
|
|
For this reason, it is needed to perform the memory load with a SEW of 32, and then move the contents to a 64-bit register using `vmv.v.v` (as with TODO union_int_float_t in section X). |
|
|
\ No newline at end of file |