... | ... | @@ -153,5 +153,5 @@ With B=16, T=1024, 1 rank/socket |
|
|
## Trace
|
|
|
|
|
|
Here we are using B=8, T=1024, 4 ranks, 1 rank/socket
|
|
|
|
|
|
The kernel taking most of the runtime is Attention Key Value, as its runtime is proportionnal to T*T
|
|
|
![oss-mpi](uploads/e5dfb6f721eddbd936c04349d50017bb/oss-mpi.png)![oss-mpi.code_legend](uploads/5255e10205e11138e63a9d6ee1ca0908/oss-mpi.code_legend.png) |