|
|
download here LAMMPS [lammps-23Jun2022.tar.gz](https://download.lammps.org/tars/lammps-23Jun2022.tar.gz (version used in this study)
|
|
|
download here LAMMPS [lammps-23Jun2022.tar.gz](https://download.lammps.org/tars/lammps-23Jun2022.tar.gz) (version used in this study)
|
|
|
|
|
|
# Introduction
|
|
|
|
... | ... | @@ -110,7 +110,6 @@ It may be interesting to do some tests to see how higher inner loop iteration co |
|
|
| `pair_style lj/charmm/coul/long 8.0 10.0` | 375 |
|
|
|
| `pair_style lj/charmm/coul/long 8.0 16.1` | 1290|
|
|
|
|
|
|
|
|
|
### Managing different code paths
|
|
|
|
|
|
In *Overview of Algorithm and Data structures* we presented the different code paths in the function.
|
... | ... | @@ -120,7 +119,7 @@ With the RISC-V vector extension, this can be overcame with the help of masked i |
|
|
|
|
|
For instance, which proportion of the atom pair interactions (or inner loop iterations) belong to the *do nothing* group?
|
|
|
Even when using masked instructions to avoid updating *do nothing* interactions, instructions take some time to execute.
|
|
|
So, as opposed to the serial version, a *do nothing* interactions has the same cost in time as any other atom in the vectorized version with masked instructions.
|
|
|
So, as opposed to the serial version, a *do nothing* interaction has the same cost in time as any other atom in the vectorized version with masked instructions.
|
|
|
|
|
|
Before starting working on the vectorization, the code was modified to count the number of interactions that belong to each category.
|
|
|
The flowchart shows the average number number of interactions (for a single `i` atom in a timestep) that belong to each category, and the arrows show the same information in percentage form.
|
... | ... | @@ -134,9 +133,34 @@ It may be interesting to test how the performance of the modified input affects |
|
|
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
|
|
|
This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
|
|
|
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `j` are placed in an array of 32 bit integers `jlist`.
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32 bit integers `jlist`.
|
|
|
These 32 bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64 bit floating point numbers.
|
|
|
This access cannot be vectorized
|
|
|
This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
|
|
|
|
|
|
With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
|
|
|
A fixed size ` vle32.v` load with an internal SEW of 64 bits would be able to extend 32 bit integers to 64 bit.
|
|
|
|
|
|
In the 0.7 VPU, this type of loads are not implemented.
|
|
|
On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64 bits), altough it has several limitations.
|
|
|
|
|
|
First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
|
|
|
Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
|
|
|
This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
|
|
|
For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
|
|
|
|
|
|
The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
|
|
|
For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
|
|
|
This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
|
|
|
Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
|
|
|
For this reason, this approach has not been used.
|
|
|
|
|
|
In the end, a bithack trick was used to extend the array of 32 bit unsigned integers into a vector register with SEW width of 64 bits.
|
|
|
The trick is to load the array as if it had 64 bit elements, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
|
|
|
To get the other half, it is needed to perform a shift right logic before applying the `vand`.
|
|
|
This method is very low level and depends on the endianness of the system in order to work (TODO elaborate why).
|
|
|
Moreover, it also requires a bit of extra handling for the case in which
|
|
|
To see the code in detail, check annex (TODO).
|
|
|
|
|
|
|
|
|
atom_vec.h -> contains `**x` and `**f` (3D)
|
|
|
neigh_list.h -> contains `**firstneigh` (for each i, store array of neighbors j)
|
... | ... | |