|
|
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
|
|
|
This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
|
|
|
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32-bit integers `jlist`.
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `j` can be found in a array of 32-bit integers `jlist`.
|
|
|
These 32-bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64-bit floating point numbers.
|
|
|
This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
|
|
|
This access of atom properties cannot be vectorized using a load indexed instruction, since it requires that both the result and the index vector have the same SEW element width.
|
|
|
|
|
|
With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
|
|
|
A fixed size ` vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.
|
|
|
With the 1.0 VPU, this could be solved easily using fixed size loads that do not depend on the internal SEW setting.
|
|
|
A fixed size `vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.
|
|
|
|
|
|
In the 0.7 VPU, this type of loads are not implemented.
|
|
|
On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64-bits), altough it has several limitations.
|
|
|
On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (*e.g.* from 32 to 64-bits), altough it has several limitations.
|
|
|
|
|
|
First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
|
|
|
Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
|
|
|
This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
|
|
|
For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
|
|
|
|
|
|
The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
|
|
|
For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
|
|
|
To overcome this, the proposed solution was to try to use `vwadd` with inline assembler to avoid the previous conflict, since inline assembler is not aware of register data types, but that was deemed as too inconsistent.
|
|
|
For the particular implementation of `vwadd` in the VPU, output is written to two consecutive registers, altough only the first one is specified as an operand.
|
|
|
This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
|
|
|
Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
|
|
|
For this reason, this approach has been discarded.
|
... | ... | |