... | @@ -108,23 +108,12 @@ Comparison of the update phase, without considering task overlapping. We will me |
... | @@ -108,23 +108,12 @@ Comparison of the update phase, without considering task overlapping. We will me |
|
| Strategy 3 | Not tested | Not tested |
|
|
| Strategy 3 | Not tested | Not tested |
|
|
|
|
|
|
Note that the strategy 3 is the one used in the GPU version. However, it is much simpler for us to use the strategy 2 with the task-based implementation.
|
|
Note that the strategy 3 is the one used in the GPU version. However, it is much simpler for us to use the strategy 2 with the task-based implementation.
|
|
|
|
From now, we will use strategy 2.
|
|
|
|
|
|
# Multi-reduce + broadcast communications
|
|
# Benchmarking
|
|
|
|
|
|
Some traces, with 4 ranks, 1 rank/socket, B = 4, T=1024
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
## Increasing the number of ranks
|
|
|
|
|
|
Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) :
|
|
Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) :
|
|
| Number of ranks | Communication time from last to first | Communication time from first to last |
|
|
| Number of ranks | Communication time from last to first | Communication time from first to last |
|
... | @@ -137,8 +126,6 @@ Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) : |
... | @@ -137,8 +126,6 @@ Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) : |
|
As you can see, we have a better communication time with 4 ranks. This is because we can adjust the communication block's sizes. Currently, the size have been manually adjusted for 4 ranks, explaining why the best communications time have been set for 4 ranks.
|
|
As you can see, we have a better communication time with 4 ranks. This is because we can adjust the communication block's sizes. Currently, the size have been manually adjusted for 4 ranks, explaining why the best communications time have been set for 4 ranks.
|
|
|
|
|
|
|
|
|
|
# Increasing the number of ranks
|
|
|
|
|
|
|
|
With B=16, T=1024, 1 rank/socket
|
|
With B=16, T=1024, 1 rank/socket
|
|
|
|
|
|
| Number of ranks | time/iteration | tokens/s | tokens/(s.cpus) |
|
|
| Number of ranks | time/iteration | tokens/s | tokens/(s.cpus) |
|
... | @@ -150,7 +137,7 @@ With B=16, T=1024, 1 rank/socket |
... | @@ -150,7 +137,7 @@ With B=16, T=1024, 1 rank/socket |
|
| 16 | 4084 ms | 4011 | 4.48 |
|
|
| 16 | 4084 ms | 4011 | 4.48 |
|
|
|
|
|
|
|
|
|
|
# Increasing the batch size
|
|
## Increasing the batch size
|
|
|
|
|
|
4 ranks, 1 rank/socket, T=1024
|
|
4 ranks, 1 rank/socket, T=1024
|
|
|
|
|
... | | ... | |