... | @@ -81,11 +81,11 @@ We deemed to extract the not "do-nothing" elements would be too costly, since th |
... | @@ -81,11 +81,11 @@ We deemed to extract the not "do-nothing" elements would be too costly, since th |
|
For this reason, we decided to use the masking approach, even if it makes "do nothing" elements as slow as the rest.
|
|
For this reason, we decided to use the masking approach, even if it makes "do nothing" elements as slow as the rest.
|
|
|
|
|
|
This type of "masking" approach is not suitable for the elements labeled as "slow" (the ones involving `sqrt` and `exp`), since all elements would need a computation time of "slow" and "fast" combined.
|
|
This type of "masking" approach is not suitable for the elements labeled as "slow" (the ones involving `sqrt` and `exp`), since all elements would need a computation time of "slow" and "fast" combined.
|
|
The fact that there are so few "slow" elements (around 0.3%) makes it possible to try to use the "vextract" method.
|
|
The fact that there are so few "slow" elements (around 0.3%) makes it feasible to try to use the "vextract" method.
|
|
Since the instruction is unavailable, we used a loop of `
|
|
Since the instruction is unavailable, we used a loop of `vmfirst` in order to mask the "slow" elements in the vector register and update them separately using the serial function `compute_iterj_special`.
|
|
The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
|
|
|
|
It may be interesting to test how the modified input affects performance compared to the serial version.
|
|
|
|
|
|
|
|
|
|
The modified input manages to reduce the proportion of interactions that belong to the *do nothing* and *slow* categories.
|
|
|
|
It may be interesting to test how the modified input affects performance in both serial and vectorized versions.
|
|
|
|
|
|
### Managing 32-bit and 64 data types
|
|
### Managing 32-bit and 64 data types
|
|
|
|
|
... | | ... | |