... | ... | @@ -131,19 +131,22 @@ It may be interesting to test how the performance of the modified input affects |
|
|
|
|
|
### Managing 32 bit and 64 data types
|
|
|
|
|
|
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
|
|
|
This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
|
|
|
|
|
|
To provide an example, in `i,j` interactions, the identifiers of atoms `j` are placed in an array of 32 bit integers `jlist`.
|
|
|
These 32 bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64 bit floating point numbers.
|
|
|
This access cannot be vectorized
|
|
|
|
|
|
atom_vec.h -> contains `**x` and `**f` (3D)
|
|
|
neigh_list.h -> contains `**firstneigh` (for each i, store array of neighbors j)
|
|
|
PairLJCharmmCoulLong::settings ->
|
|
|
|
|
|
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
|
|
|
This can become somewhat of an issue with the `jlist` array, which is an array of `int`s which act as indexes of `x` and
|
|
|
|
|
|
- **Specialization**: calls to subroutines inside compute only happen on the first and last iteration
|
|
|
- Data structures: in which classes is the information about atoms (position, force) stored, how? array of pointer to array
|
|
|
- **Structure of the code** - present the flowchart, show the different code paths: do-nothing, fast, slow...
|
|
|
- Abandoned idea: copied from the INTEL version - the "classify-loop" - store elements that can be computed vectorially in a buffer (in serial beacuse 0.7) and then process them vectorially.
|
|
|
- Abandoned idea: copied from the INTEL version - the "classify-loop" - store elements that can be computed vectorially in a buffer (in serial because 0.7) and then process them vectorially.
|
|
|
- Implemented idea: use a combination of masked operations form elements that do not be processed and using a vmfirst loop to find the elements that need to be processed in serial
|
|
|
- **Problems**
|
|
|
- pointer to pointer - often requires two load indexed operations
|
... | ... | |