... | ... | @@ -51,7 +51,7 @@ These values always follow: |
|
|
- `cut_bothsq = MIN(cut_ljsq, cut_coulsq)`
|
|
|
|
|
|
In the code, `rsq` represents the distance between atoms `i,j`. It is saved in squared form to avoid computing an expensive `sqrt`.
|
|
|
- If `rsq` is larger than `cut_bothsq`, then, no computation is required because there is no short-range interaction between the two atoms. In that case, the inner loop iteration stops here.
|
|
|
- If `rsq` is larger than `cut_bothsq`, then, no computation is required because there is no short-range interaction between the two atoms. In that case, the inner loop iteration stops here (*do nothing*).
|
|
|
- If `rsq` is smaller than `tabinnersq`, then `forcecoul` is computed using a *fast* table method. If not, it is computed using a *slow* method with calls to `sqrt` and `exp` functions.
|
|
|
- If `rsq` is bigger than `cut_lj_innersq`, then `forcelj` needs a few additional computations.
|
|
|
|
... | ... | @@ -86,8 +86,48 @@ The specialized function is called when possible, and if not the execution falls |
|
|
Another factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
|
|
|
Targeting only this case for the specialization this case for the optimization can lead to simpler code.
|
|
|
|
|
|
### Loop size
|
|
|
|
|
|
When vectorizing a loop, it is important to consider the number of loop iterations.
|
|
|
It may not be worth vectorizing a loop that features a very small iteration count with underusing of the vector registers, since it may prove slower than the serial version.
|
|
|
|
|
|
`PairLJCharmmCoulLong::compute` deals with interactions of pair of atoms `i,j`.
|
|
|
The outer loop that traverses through the 32000 `i` atoms in the protein input.
|
|
|
For each `i` atom, the inner loop traverses through atoms `j` that are neighbors of `i`.
|
|
|
Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`).
|
|
|
|
|
|
Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
|
|
|
After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
|
|
|
Considering that registers in the 0.7 vector unit can hold up to 256 64 bit elements, the loop is suitable for vectorizing iteration count wise.
|
|
|
|
|
|
One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
|
|
|
This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.
|
|
|
The largest interaction distance accepted by LAMMPS produced an average of 1290 inner loop iterations.
|
|
|
It may be interesting to do some tests to see if the performance improves with higher inner loop iteration counts.
|
|
|
|
|
|
| input line | inner loop avg. iterations |
|
|
|
| ---------- | -------------------------- |
|
|
|
| `pair_style lj/charmm/coul/long 8.0 10.0` | 375 |
|
|
|
| `pair_style lj/charmm/coul/long 8.0 16.1` | 1290|
|
|
|
|
|
|
|
|
|
### Managing different code paths
|
|
|
|
|
|
In *Overview of Algorithm and Data structures* we presented the different code paths in the function.
|
|
|
It is important to consider carefully the implications of this when vectorizing.
|
|
|
Vectorization is based on SIMD processing (single instruction, multiple data), but different code paths require different instructions.
|
|
|
With the RISC-V vector extension, this can be overcame with the help of masked instructions, which allows restricting writing the result of a vector instructions to only certain elements using a bitmask.
|
|
|
|
|
|
For instance, which proportion of the atom pairs belong to the *do nothing* group?
|
|
|
Even when using masked instructions to avoid updating *do nothing* atoms, instructions take some time to execute.
|
|
|
So, as opposed to the serial version, a *do nothing* atom has the same cost in time as any other atom in the vectorized version with masked instructions.
|
|
|
|
|
|
Before starting working on the vectorization, the code de was modified to count the number of atoms that belong to each category.
|
|
|
The flowchart shows in black, the average number number of
|
|
|
|
|
|
|
|
|
### Managing 32 bit and 64 data types
|
|
|
|
|
|
atom_vec.h -> contains `**x` and `**f` (3D)
|
|
|
neigh_list.h -> contains `**firstneigh` (for each i, store array of neighbors j)
|
|
|
PairLJCharmmCoulLong::settings ->
|
... | ... | |