... | ... | @@ -6,7 +6,7 @@ We will focus here on the transfer of the gradients from the 'worker' ranks to t |
|
|
Note that :
|
|
|
- Transfer of gradients from worker to server is done through MPI_Reduce (SUM)
|
|
|
- Transfer of updated parameters from server to workers is done through MPI_Bcast
|
|
|
- The data sent for each of these operations are about 250 000 floats so about 1 Gb.
|
|
|
- The data sent for each of these communications are about 250 000 floats so about 1 Gb.
|
|
|
- According to [MN5 overview](https://www.bsc.es/supportkc/docs/MareNostrum5/overview), the transfer speed of a node is about 10Gb/s
|
|
|
- We are using one socket per rank
|
|
|
|
... | ... | @@ -64,13 +64,17 @@ Dt = 2.5 s |
|
|
|
|
|
# Bottleneck
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
At this point, the bottleneck of this distributed version is to have an efficient update phase. An update phase will typically be made of 2 or 3 steps:
|
|
|
- Sum all the distributed gradients
|
|
|
- Update the parameters using the summed gradients array
|
|
|
- If necessary, share the updated parameters to all ranks
|
|
|
|
|
|
As the transfer speed limit in and out of a node is 10 Gb/s, we need to reduce as much as possible the communications on one single node to avoid congestion. As an example, this is what happened in the example with 9 ranks. Let's investigate it:
|
|
|
With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective operation to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. Moreover, each node needs to send 2Gb (2 ranks per node), therefore, by still considering a simple reduction strategy, the sending of 2 Gb would take 200 ms for the node. By also taking into account the pipeline, we can understand why the reduce operation took about 500ms in the last example with 9 ranks. The same reasoning can be applied to the broadcast operation.
|
|
|
With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective communication to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. Moreover, each node needs to send 2Gb (2 ranks per node), therefore, by still considering a simple reduction strategy, the sending of 2 Gb would take 200 ms for the node. By also taking into account the pipeline, we can understand why the reduce communication took about 500ms in the last example with 9 ranks. The same reasoning can be applied to the broadcast communication.
|
|
|
|
|
|
## Solutions
|
|
|
|
|
|
Thereby, we need to find more efficient ways to execute the update section. The strategies we can think about are:
|
|
|
|
... | ... | @@ -92,4 +96,14 @@ Thereby, we need to find more efficient ways to execute the update section. The |
|
|
With strategy 2, we can avoid transfer congestion on one single node, by transferring the same amount of data on each node. This way, we are also leveraging the bi-directional links between nodes on MN5.
|
|
|
For strategy 3, we can hope that, by using less MPI communications, we can reduce reduce the total time taken by this update phase. We can also hope that MPI has good optimization on its combined communication for reduce and scatter, which would eventually reduce transfer congestion.
|
|
|
|
|
|
Finally, for all these strategies, we can also think about moving from the classic MPI_SUM operation to user defined one, which would leverage OmpSs-2 or OpenMP for computation parallelization. |
|
|
\ No newline at end of file |
|
|
Finally, for all these strategies, we can also think about moving from the classic MPI_SUM operation to user defined one, which would leverage OmpSs-2 or OpenMP for computation parallelization.
|
|
|
|
|
|
## Analysis and comparison
|
|
|
|
|
|
Comparison of the update phase, without considering task overlapping.
|
|
|
|
|
|
| Version | Update phase time |
|
|
|
| ---------- | ----------------- |
|
|
|
| Strategy 1 | 665 ms |
|
|
|
| Strategy 2 | 210 - 230 ms |
|
|
|
| Strategy 3 | | |
|
|
\ No newline at end of file |