... | ... | @@ -60,4 +60,14 @@ Dt = 2.5 s |
|
|
- Forward + backward pass takes about 100 ms
|
|
|
- Time per iteration is about 720 ms
|
|
|
|
|
|
![base-9-priority.code_legend](uploads/fd7dfb7130bb685aa20d6636fe20d339/base-9-priority.code_legend.png) |
|
|
\ No newline at end of file |
|
|
![base-9-priority.code_legend](uploads/fd7dfb7130bb685aa20d6636fe20d339/base-9-priority.code_legend.png)
|
|
|
|
|
|
# Bottleneck
|
|
|
|
|
|
At this point, the bottleneck of this distributed version is to have an efficient update phase. An update phase will typically be made of 2 or 3 steps:
|
|
|
- Sum all the distributed gradients
|
|
|
- Update the parameters using the summed gradients array
|
|
|
- If necessary, share the updated parameters to all ranks
|
|
|
|
|
|
As the transfer speed limit in and out of a node is 10 Gb/s, we need to reduce as much as possible the communications on one single node to avoid congestion. As an example, this is what happened in the example with 9 ranks. Let's investigate it:
|
|
|
With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective operation to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. |
|
|
\ No newline at end of file |