Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • Distributed Model

Distributed Model · Changes

Page history
Update Distributed Model authored Aug 13, 2024 by tchatela's avatar tchatela
Hide whitespace changes
Inline Side-by-side
Distributed-Model.md
View page @ db829a98
...@@ -55,8 +55,8 @@ Dt = 2.5 s ...@@ -55,8 +55,8 @@ Dt = 2.5 s
![all-nine](uploads/2d471a1e2cf8dc0bf30f957a4554d9e5/all-nine.png) ![all-nine](uploads/2d471a1e2cf8dc0bf30f957a4554d9e5/all-nine.png)
- Broadcast takes about 500 ms - Broadcast takes about 115 ms
- Reduce takes about 115 ms - Reduce takes about 500 ms
- Forward + backward pass takes about 100 ms - Forward + backward pass takes about 100 ms
- Time per iteration is about 720 ms - Time per iteration is about 720 ms
...@@ -70,4 +70,26 @@ At this point, the bottleneck of this distributed version is to have an efficien ...@@ -70,4 +70,26 @@ At this point, the bottleneck of this distributed version is to have an efficien
- If necessary, share the updated parameters to all ranks - If necessary, share the updated parameters to all ranks
As the transfer speed limit in and out of a node is 10 Gb/s, we need to reduce as much as possible the communications on one single node to avoid congestion. As an example, this is what happened in the example with 9 ranks. Let's investigate it: As the transfer speed limit in and out of a node is 10 Gb/s, we need to reduce as much as possible the communications on one single node to avoid congestion. As an example, this is what happened in the example with 9 ranks. Let's investigate it:
With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective operation to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. With 9 ranks, we are using 5 nodes. At the end of the forward pass, we are using a Reduce collective operation to sum all the gradients on one node. This means that, with a basic reduction strategy, 4 nodes will send 1Gb to a single node. Therefore, node 0 will receive a total of 4 Gb, which creates congestion. With a in transfer speed of 10Gb/s per node, we can expect the transfer to take 400ms. Moreover, each node needs to send 2Gb (2 ranks per node), therefore, by still considering a simple reduction strategy, the sending of 2 Gb would take 200 ms for the node. By also taking into account the pipeline, we can understand why the reduce operation took about 500ms in the last example with 9 ranks. The same reasoning can be applied to the broadcast operation.
\ No newline at end of file
Thereby, we need to find more efficient ways to execute the update section. The strategies we can think about are:
**Strategy 1 - Server node + worker nodes:**
- MPI_Reduce(MPI_SUM) to rank 0
- Update on rank 0
- MPI_Broadcast from rank 0
**Strategy 2 - Worker nodes only, Sliced Reduce & Broadcast:**
- Slice the gradients into nb_ranks and apply an MPI_Reduce(MPI_SUM) to each slice, with a different root node each time
- Update a slice of parameters on each rank
- Each rank broadcast its own slice of updated parameters to all other ranks
**Strategy 3 - Worker nodes only, Reduce_scatter & update:**
- Use MPI_Reduce_scatter_block(MPI_SUM) to get summed gradients array on all nodes
- Update all the parameters on all nodes
With strategy 2, we can avoid transfer congestion on one single node, by transferring the same amount of data on each node. This way, we are also leveraging the bi-directional links between nodes on MN5.
For strategy 3, we can hope that, by using less MPI communications, we can reduce reduce the total time taken by this update phase. We can also hope that MPI has good optimization on its combined communication for reduce and scatter, which would eventually reduce transfer congestion.
Finally, for all these strategies, we can also think about moving from the classic MPI_SUM operation to user defined one, which would leverage OmpSs-2 or OpenMP for computation parallelization.
\ No newline at end of file
Clone repository

GPT2 Parallelization and Porting

  • Model Description
  • Runtime and Performances
  • Improvements
  • Traces
  • Fork Join Model
  • Task Based Model
  • Distributed Model