... | @@ -22,7 +22,7 @@ In this section we discuss the implementation of the optimized `PairLJCharmmCoul |
... | @@ -22,7 +22,7 @@ In this section we discuss the implementation of the optimized `PairLJCharmmCoul |
|
|
|
|
|
### Specialization
|
|
### Specialization
|
|
|
|
|
|
The end of `PairLJCharmmCoulLong::compute` contains function calls to `ev_tally` or `virial_fdotr_compute`, which are run depending on the value or `evflag` and `vflag_fdotr` variables.
|
|
The end of `PairLJCharmmCoulLong::compute` contains function calls to `ev_tally` or `virial_fdotr_compute`. Whether these are run or not depends on the value or `evflag` and `vflag_fdotr` variables.
|
|
An analysis using GDB breakpoints showed that, for the protein input, these funcions are only called on the first and last timesteps of the execution.
|
|
An analysis using GDB breakpoints showed that, for the protein input, these funcions are only called on the first and last timesteps of the execution.
|
|
This paraver trace shows the weight of the first and last iterations compared to the rest.
|
|
This paraver trace shows the weight of the first and last iterations compared to the rest.
|
|
|
|
|
... | @@ -33,17 +33,18 @@ Having function calls inside the function to be optimized can be troublesome bec |
... | @@ -33,17 +33,18 @@ Having function calls inside the function to be optimized can be troublesome bec |
|
2. If not autovectorized, the function would need to be vectorized with intrinsics, with all the additional work.
|
|
2. If not autovectorized, the function would need to be vectorized with intrinsics, with all the additional work.
|
|
3. If kept serial, then a mechanism to unpack data from the vector registers would still be needed.
|
|
3. If kept serial, then a mechanism to unpack data from the vector registers would still be needed.
|
|
|
|
|
|
After considering this, we decided that the specialized function should only target the case in which the functions are not called.
|
|
After considering this analysis, we decided on writing an specialized routine that only targets the case in which the functions are not called.
|
|
Now, there are two funcions, `compute_loopi_original` and `compute_loopi_special`.
|
|
Now, there are two routines, `compute_loopi_original` and `compute_loopi_special`.
|
|
The specialized function is called when possible, and if not the execution falls back on the original function.
|
|
The specialized routine is called when possible, if not, execution falls back on the original function.
|
|
|
|
|
|
Another factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
|
|
Another factor that has been taken into account with the specialization is the fact that the protein input spcript `in.protein` uses the form `pair_style lj/charmm/coul/long X Y` with only two parameters, which implies that `cut_ljsq = cut_coulsq`.
|
|
Targeting only this case for the specialization this case for the optimization can lead to simpler code.
|
|
Targeting only this case for the specialization can lead to simpler code, altough it would fall back on the original function when used with three parameters.
|
|
|
|
For more information about the two and three parameter invokation, check the LAMMPS [documentation](https://docs.lammps.org/pair_charmm.html#pair-style-lj-charmm-coul-long-command).
|
|
|
|
|
|
### Loop size
|
|
### Loop size
|
|
|
|
|
|
When vectorizing a loop, it is important to consider the number of loop iterations.
|
|
When vectorizing a loop, it is important to consider the number of loop iterations.
|
|
It may not be worth vectorizing a loop that features a very small iteration count with underusing of the vector registers, since it may prove slower than the serial version.
|
|
It may not be worth vectorizing a loop that features a very small iteration count, since it can lead to underusing of the vector registers and may prove slower than the serial version.
|
|
|
|
|
|
`PairLJCharmmCoulLong::compute` deals with interactions of pair of atoms `i,j`.
|
|
`PairLJCharmmCoulLong::compute` deals with interactions of pair of atoms `i,j`.
|
|
The outer loop that traverses through the 32000 `i` atoms in the protein input.
|
|
The outer loop that traverses through the 32000 `i` atoms in the protein input.
|
... | @@ -52,7 +53,7 @@ Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`) |
... | @@ -52,7 +53,7 @@ Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`) |
|
|
|
|
|
Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
|
|
Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
|
|
After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
|
|
After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
|
|
Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop is suitable for vectorizing iteration count wise.
|
|
Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop can be considered suitable for vectorizing iteration count wise.
|
|
|
|
|
|
One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
|
|
One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
|
|
This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.
|
|
This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.
|
... | | ... | |