... | ... | @@ -17,95 +17,6 @@ The input contains two files: |
|
|
- `pair_style lj/charmm/coul/long 8.0 10.0` This is the line that specifies that we are using the *lj/charmm/coul/charmm* *pair_style*. If a different *pair_style* was selected, then `PairLJCharmmCoulLong::compute` would not be executed, and the optimizations would not have any impact. Moreover, the `8.0`and `10.0` represent the *cutoff distances*, which can have an impact to the execution of the function. For more information, check the LAMMPS [documentation](https://docs.lammps.org/pair_charmm.html#pair-style-lj-charmm-coul-long-command).
|
|
|
- `data.protein`: contains the initial data for the atoms in the simulation and their properties
|
|
|
|
|
|
### Loop size
|
|
|
|
|
|
When vectorizing a loop, it is important to consider the number of loop iterations.
|
|
|
It may not be worth vectorizing a loop that features a very small iteration count, since it can lead to underusing of the vector registers and may prove slower than the serial version.
|
|
|
|
|
|
`PairLJCharmmCoulLong::compute` deals with interactions of pair of atoms `i,j`.
|
|
|
The outer loop that traverses through the 32000 `i` atoms in the protein input.
|
|
|
For each `i` atom, the inner loop traverses through atoms `j` that are neighbors of `i`.
|
|
|
Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`).
|
|
|
|
|
|
Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
|
|
|
After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
|
|
|
Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop can be considered suitable for vectorizing iteration count wise.
|
|
|
|
|
|
One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
|
|
|
This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.
|
|
|
The largest interaction distance accepted by LAMMPS produced an average of 1290 inner loop iterations.
|
|
|
It may be interesting to do some tests to see how higher inner loop iteration counts affect performance when comparing with the serial version.
|
|
|
|
|
|
| input line | inner loop avg. iterations |
|
|
|
| ---------- | -------------------------- |
|
|
|
| `pair_style lj/charmm/coul/long 8.0 10.0` | 375 |
|
|
|
| `pair_style lj/charmm/coul/long 8.0 16.1` | 1290|
|
|
|
|
|
|
### Managing different code paths
|
|
|
|
|
|
In *Overview of Algorithm and Data structures* we presented the different code paths in the function.
|
|
|
It is important to consider carefully the implications of this when vectorizing.
|
|
|
Vectorization is based on SIMD processing (single instruction, multiple data), but different code paths require different instructions.
|
|
|
With the RISC-V vector extension, this can be overcame with the help of masked instructions, which allows restricting writing the result of a vector instructions to only certain elements using a bitmask.
|
|
|
|
|
|
For instance, which proportion of the atom pair interactions (or inner loop iterations) belong to the *do nothing* group?
|
|
|
Even when using masked instructions, we can avoid updating data for *do nothing* interactions, but the execution time required for processing this data cannot be avoided.
|
|
|
So, as opposed to the serial version, a *do nothing* interaction has the same cost in time as any other atom in the vectorized version with masked instructions.
|
|
|
|
|
|
Before starting working on the vectorization, the code was modified to count the number of interactions that belong to each category.
|
|
|
The flowchart shows the average number number of interactions (for a single `i` atom in a timestep) that belong to each category, and the arrows show the same information in percentage form.
|
|
|
Black values show data for the default protein input, while red values correspond to the modified input described in section *Loop size*.
|
|
|
|
|
|
We can see how the proportion of "do nothing" elements in the regular input is about 42%.
|
|
|
We deemed to extract the not "do-nothing" elements would be too costly, since the proportion is too high, and the accelerator lacks the `vcompress` [instruction](https://github.com/riscv/riscv-v-spec/blob/0.7.1/v-spec.adoc#176-vector-compress-instruction) that implements this (see [ISA support](https://repo.hca.bsc.es/gitlab/EPI/RTL/Vector_Accelerator/-/wikis/VPU/ISA-support)).
|
|
|
For this reason, we decided to use the masking approach, even if it makes "do nothing" elements as slow as the rest.
|
|
|
|
|
|
This type of "masking" approach is not suitable for the elements labeled as "slow" (the ones involving `sqrt` and `exp`), since all elements would need a computation time of "slow" and "fast" combined.
|
|
|
The fact that there are so few "slow" elements (around 0.3%) makes it feasible to try to use the "vextract" method.
|
|
|
Since the instruction is unavailable, we used a loop of `vmfirst` in order to mask the "slow" elements in the vector register and update them separately using the serial function `compute_iterj_special`.
|
|
|
|
|
|
The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
|
|
|
It may be interesting to test how the modified input affects performance in both serial and vectorized versions.
|
|
|
|
|
|
### Managing 32-bit and 64 data types
|
|
|
|
|
|
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
|
|
|
This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
|
|
|
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32-bit integers `jlist`.
|
|
|
These 32-bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64-bit floating point numbers.
|
|
|
This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
|
|
|
|
|
|
With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
|
|
|
A fixed size ` vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.
|
|
|
|
|
|
In the 0.7 VPU, this type of loads are not implemented.
|
|
|
On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64-bits), altough it has several limitations.
|
|
|
|
|
|
First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
|
|
|
Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
|
|
|
This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
|
|
|
For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
|
|
|
|
|
|
The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
|
|
|
For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
|
|
|
This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
|
|
|
Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
|
|
|
For this reason, this approach has been discarded.
|
|
|
|
|
|
In the end, a bithack trick was used to extend the array of 32-bit unsigned integers into a vector register with SEW width of 64-bits.
|
|
|
The trick is to load the 32-bit array into a register with 64-bit SEW, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
|
|
|
To get the other half, it is needed to perform a shift right logic before applying the `vand`.
|
|
|
In addition, this method also requires a bit of extra handling for the case in which the array has and odd number of elements.
|
|
|
The following figure shows a representation of the operations needed for the 32-bit to 64-bit conversion.
|
|
|
To see the code in detail, check annex (TODO).
|
|
|
|
|
|
![evenodd](uploads/82a21f4b6d4321ea2481d04fa9818f9e/evenodd.png)
|
|
|
|
|
|
It is important to check if a unaligned memory access exception can be produced with vector loads.
|
|
|
For instance, to place a 32-bit array inside a 64-bit register, performing a load with a SEW of 64 produces an unaligned access exception if the starting address is not aligned.
|
|
|
For this reason, it is needed to perform the memory load with a SEW of 32, and then move the contents to a 64-bit register using `vmv.v.v` (as with TODO union_int_float_t in section X).
|
|
|
|
|
|
### Handling union_int_float_t
|
|
|
|
|
|
`PairLJCharmmCoulLong::compute` defines an union type `union_int_float_t` in order to be able with operate with the bits of a 32bit floating point number `rsq` as if it was a 32-bit integer.
|
... | ... | |