Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • S sdv-lammps
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 100
    • Issues 100
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • djurado
  • sdv-lammps
  • Wiki
  • Home

Home · Changes

Page history
Update Home authored Jan 17, 2023 by djurado's avatar djurado
Hide whitespace changes
Inline Side-by-side
Home.md
View page @ 156daf95
......@@ -17,95 +17,6 @@ The input contains two files:
- `pair_style lj/charmm/coul/long 8.0 10.0` This is the line that specifies that we are using the *lj/charmm/coul/charmm* *pair_style*. If a different *pair_style* was selected, then `PairLJCharmmCoulLong::compute` would not be executed, and the optimizations would not have any impact. Moreover, the `8.0`and `10.0` represent the *cutoff distances*, which can have an impact to the execution of the function. For more information, check the LAMMPS [documentation](https://docs.lammps.org/pair_charmm.html#pair-style-lj-charmm-coul-long-command).
- `data.protein`: contains the initial data for the atoms in the simulation and their properties
### Loop size
When vectorizing a loop, it is important to consider the number of loop iterations.
It may not be worth vectorizing a loop that features a very small iteration count, since it can lead to underusing of the vector registers and may prove slower than the serial version.
`PairLJCharmmCoulLong::compute` deals with interactions of pair of atoms `i,j`.
The outer loop that traverses through the 32000 `i` atoms in the protein input.
For each `i` atom, the inner loop traverses through atoms `j` that are neighbors of `i`.
Moreover, each `i` atom can have a different amount of neighbors (`numneigh[i]`).
Our optimization targets the vectorization of the inner loop, so its iteration count will determine if the loop is worth vectorizing.
After modifying the code to print the number of inner loop iterations, we found that on average, the inner loop contains 375 iterations.
Considering that registers in the 0.7 vector unit can hold up to 256 64-bit elements, the loop can be considered suitable for vectorizing iteration count wise.
One can increase the number of iterations in the inner loop by increasing the neighbor distance threshold.
This neighbor threshold is set automatically according to the interaction distance thresholds specified in the `pair_style lj/charmm/coul/long X Y` command.
The largest interaction distance accepted by LAMMPS produced an average of 1290 inner loop iterations.
It may be interesting to do some tests to see how higher inner loop iteration counts affect performance when comparing with the serial version.
| input line | inner loop avg. iterations |
| ---------- | -------------------------- |
| `pair_style lj/charmm/coul/long 8.0 10.0` | 375 |
| `pair_style lj/charmm/coul/long 8.0 16.1` | 1290|
### Managing different code paths
In *Overview of Algorithm and Data structures* we presented the different code paths in the function.
It is important to consider carefully the implications of this when vectorizing.
Vectorization is based on SIMD processing (single instruction, multiple data), but different code paths require different instructions.
With the RISC-V vector extension, this can be overcame with the help of masked instructions, which allows restricting writing the result of a vector instructions to only certain elements using a bitmask.
For instance, which proportion of the atom pair interactions (or inner loop iterations) belong to the *do nothing* group?
Even when using masked instructions, we can avoid updating data for *do nothing* interactions, but the execution time required for processing this data cannot be avoided.
So, as opposed to the serial version, a *do nothing* interaction has the same cost in time as any other atom in the vectorized version with masked instructions.
Before starting working on the vectorization, the code was modified to count the number of interactions that belong to each category.
The flowchart shows the average number number of interactions (for a single `i` atom in a timestep) that belong to each category, and the arrows show the same information in percentage form.
Black values show data for the default protein input, while red values correspond to the modified input described in section *Loop size*.
We can see how the proportion of "do nothing" elements in the regular input is about 42%.
We deemed to extract the not "do-nothing" elements would be too costly, since the proportion is too high, and the accelerator lacks the `vcompress` [instruction](https://github.com/riscv/riscv-v-spec/blob/0.7.1/v-spec.adoc#176-vector-compress-instruction) that implements this (see [ISA support](https://repo.hca.bsc.es/gitlab/EPI/RTL/Vector_Accelerator/-/wikis/VPU/ISA-support)).
For this reason, we decided to use the masking approach, even if it makes "do nothing" elements as slow as the rest.
This type of "masking" approach is not suitable for the elements labeled as "slow" (the ones involving `sqrt` and `exp`), since all elements would need a computation time of "slow" and "fast" combined.
The fact that there are so few "slow" elements (around 0.3%) makes it feasible to try to use the "vextract" method.
Since the instruction is unavailable, we used a loop of `vmfirst` in order to mask the "slow" elements in the vector register and update them separately using the serial function `compute_iterj_special`.
The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
It may be interesting to test how the modified input affects performance in both serial and vectorized versions.
### Managing 32-bit and 64 data types
LAMMPS uses 64-bit `double` precision numbers for floating point calculations and 32-bit `int` numbers for integer computations.
This can become somewhat of an issue with the vectorization, specifically with indexed memory instructions in 0.7.
To provide an example, in `i,j` interactions, the identifiers of atoms `i` `j` can be found in a array of 32-bit integers `jlist`.
These 32-bit integers are later used as an index for accessing atom properties such as position (`**x`), which are stored in a array of 64-bit floating point numbers.
This access cannot be vectorized using a load indexed instruction, since this instructions requires that both the result and the index vector have the same SEW element width.
With the 1.0 VPU, this could be solved easily with the use of fixed size loads that do not depend on the internal SEW setting.
A fixed size ` vle32.v` load with an internal SEW of 64-bits would be able to extend 32-bit integers to 64-bit.
In the 0.7 VPU, this type of loads are not implemented.
On the other hand, the `vwadd` intrinsic allows performing a widening from SEW to 2*SEW (from 32 to 64-bits), altough it has several limitations.
First, `vwadd` is designed to work with LMUL, since it would convert the register from `__epi_2xi32` to `__epi_2xi64`, and the latter data type needs LMUL.
Since the 0.7 VPU does not support LMUL, `vwadd` is implemented differently, in which the input of size SEW `__epi_1xi64` is converted to two consecutive registers of size (`__epi_2xi32`).
This is the source of a conflict, since the intrinsics are not aware of this particular implementation of `vwadd` in the VPU and instead adhere to the standard definition that uses LMUL.
For this reason, using `vwadd` with intrinsics in the 0.7 VPU is not possible since just using a `__epi_2xi64` data type triggers an error because of the lack of LMUL support.
The proposed solution was to try to use `vwadd` with inline assembler, but that was deemed as too inconsistent.
For the implementation of this instruction in the VPU, output is written to two consecutive registers, altough only one is specified as an operand.
This can lead to compilation errors, since the inline assembler is not aware of this and may automatically choose a combination of input and output registers that overlap, and generate and error when assembling the instructions.
Sometimes the compiler produces this error, but changing the optimization setting (from -O0 to -O2) can fix the issue since a different combination of registers may be used which happen to not overlap.
For this reason, this approach has been discarded.
In the end, a bithack trick was used to extend the array of 32-bit unsigned integers into a vector register with SEW width of 64-bits.
The trick is to load the 32-bit array into a register with 64-bit SEW, and then use an `vand` operation to blank the most significant half of the elements, mimicking an extension with zeros.
To get the other half, it is needed to perform a shift right logic before applying the `vand`.
In addition, this method also requires a bit of extra handling for the case in which the array has and odd number of elements.
The following figure shows a representation of the operations needed for the 32-bit to 64-bit conversion.
To see the code in detail, check annex (TODO).
![evenodd](uploads/82a21f4b6d4321ea2481d04fa9818f9e/evenodd.png)
It is important to check if a unaligned memory access exception can be produced with vector loads.
For instance, to place a 32-bit array inside a 64-bit register, performing a load with a SEW of 64 produces an unaligned access exception if the starting address is not aligned.
For this reason, it is needed to perform the memory load with a SEW of 32, and then move the contents to a 64-bit register using `vmv.v.v` (as with TODO union_int_float_t in section X).
### Handling union_int_float_t
`PairLJCharmmCoulLong::compute` defines an union type `union_int_float_t` in order to be able with operate with the bits of a 32bit floating point number `rsq` as if it was a 32-bit integer.
......
Clone repository

Home

  1. Introduction
  2. Overview
  3. Implementation
    • Specialization
  4. Implementation

Sidebar