... | ... | @@ -33,7 +33,7 @@ After considering this analysis, we decided on writing an specialized routine th |
|
|
Now, there are two routines, `compute_loopi_original` and `compute_loopi_special`.
|
|
|
The specialized routine is called when possible, if not, execution falls back on the original function.
|
|
|
|
|
|
Another factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
|
|
|
The last factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
|
|
|
Targeting only this case for the specialization can lead to simpler code, altough it would fall back on the original function when used with three parameters.
|
|
|
For more information about the two and three parameter invokation, check the LAMMPS [documentation](https://docs.lammps.org/pair_charmm.html#pair-style-lj-charmm-coul-long-command).
|
|
|
|
... | ... | @@ -69,15 +69,23 @@ Vectorization is based on SIMD processing (single instruction, multiple data), b |
|
|
With the RISC-V vector extension, this can be overcame with the help of masked instructions, which allows restricting writing the result of a vector instructions to only certain elements using a bitmask.
|
|
|
|
|
|
For instance, which proportion of the atom pair interactions (or inner loop iterations) belong to the *do nothing* group?
|
|
|
Even when using masked instructions to avoid updating *do nothing* interactions, instructions take some time to execute.
|
|
|
Even when using masked instructions, we can avoid updating data for *do nothing* interactions, but the execution time required for processing this data cannot be avoided.
|
|
|
So, as opposed to the serial version, a *do nothing* interaction has the same cost in time as any other atom in the vectorized version with masked instructions.
|
|
|
|
|
|
Before starting working on the vectorization, the code was modified to count the number of interactions that belong to each category.
|
|
|
The flowchart shows the average number number of interactions (for a single `i` atom in a timestep) that belong to each category, and the arrows show the same information in percentage form.
|
|
|
Black values show data for the default protein input, while red values correspond to the modified input described in section *Loop size*.
|
|
|
|
|
|
We can see how the proportion of "do nothing" elements in the regular input is about 42%.
|
|
|
We deemed to extract the not "do-nothing" elements would be too costly, since the proportion is too high, and the accelerator lacks the `vcompress` [instruction](https://github.com/riscv/riscv-v-spec/blob/0.7.1/v-spec.adoc#176-vector-compress-instruction) that implements this (see [ISA support](https://repo.hca.bsc.es/gitlab/EPI/RTL/Vector_Accelerator/-/wikis/VPU/ISA-support)).
|
|
|
For this reason, we decided to use the masking approach, even if it makes "do nothing" elements as slow as the rest.
|
|
|
|
|
|
This type of "masking" approach is not suitable for the elements labeled as "slow" (the ones involving `sqrt` and `exp`), since all elements would need a computation time of "slow" and "fast" combined.
|
|
|
The fact that there are so few "slow" elements (around 0.3%) makes it possible to try to use the "vextract" method.
|
|
|
Since the instruction is unavailable, we used a loop of `
|
|
|
The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
|
|
|
It may be interesting to test how the performance of the modified input affects performance compared to the serial version.
|
|
|
It may be interesting to test how the modified input affects performance compared to the serial version.
|
|
|
|
|
|
|
|
|
### Managing 32-bit and 64 data types
|
|
|
|
... | ... | |