... | ... | @@ -55,8 +55,8 @@ Dt = 2.5 s |
|
|
|
|
|
![all-nine](uploads/2d471a1e2cf8dc0bf30f957a4554d9e5/all-nine.png)
|
|
|
|
|
|
- Broadcast takes about 500 ms
|
|
|
- Reduce takes about 115 ms
|
|
|
- Broadcast takes about 115 ms
|
|
|
- Reduce takes about 500 ms
|
|
|
- Forward + backward pass takes about 100 ms
|
|
|
- Time per iteration is about 720 ms
|
|
|
|
... | ... | @@ -70,4 +70,26 @@ At this point, the bottleneck of this distributed version is to have an efficien |
|
|
- If necessary, share the updated parameters to all ranks
|
|
|
|
|
|
As the transfer speed limit in and out of a node is 10 Gb/s, we need to reduce as much as possible the communications on one single node to avoid congestion. As an example, this is what happened in the example with 9 ranks. Let's investigate it:
|
|
|
With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective operation to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. |
|
|
\ No newline at end of file |
|
|
With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective operation to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. Moreover, each node needs to send 2Gb (2 ranks per node), therefore, by still considering a simple reduction strategy, the sending of 2 Gb would take 200 ms for the node. By also taking into account the pipeline, we can understand why the reduce operation took about 500ms in the last example with 9 ranks. The same reasoning can be applied to the broadcast operation.
|
|
|
|
|
|
Thereby, we need to find more efficient ways to execute the update section. The strategies we can think about are:
|
|
|
|
|
|
**Strategy 1 - Server node + worker nodes:**
|
|
|
- MPI_Reduce(MPI_SUM) to rank 0
|
|
|
- Update on rank 0
|
|
|
- MPI_Broadcast from rank 0
|
|
|
|
|
|
**Strategy 2 - Worker nodes only, Sliced Reduce & Broadcast:**
|
|
|
- Slice the gradients into nb_ranks and apply an MPI_Reduce(MPI_SUM) to each slice, with a different root node each time
|
|
|
- Update a slice of parameters on each rank
|
|
|
- Each rank broadcast its own slice of updated parameters to all other ranks
|
|
|
|
|
|
**Strategy 3 - Worker nodes only, Reduce_scatter & update:**
|
|
|
- Use MPI_Reduce_scatter_block(MPI_SUM) to get summed gradients array on all nodes
|
|
|
- Update all the parameters on all nodes
|
|
|
|
|
|
|
|
|
With strategy 2, we can avoid transfer congestion on one single node, by transferring the same amount of data on each node. This way, we are also leveraging the bi-directional links between nodes on MN5.
|
|
|
For strategy 3, we can hope that, by using less MPI communications, we can reduce reduce the total time taken by this update phase. We can also hope that MPI has good optimization on its combined communication for reduce and scatter, which would eventually reduce transfer congestion.
|
|
|
|
|
|
Finally, for all these strategies, we can also think about moving from the classic MPI_SUM operation to user defined one, which would leverage OmpSs-2 or OpenMP for computation parallelization. |
|
|
\ No newline at end of file |