Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • Distributed Model

Distributed Model · Changes

Page history
Update Distributed Model authored Sep 20, 2024 by tchatela's avatar tchatela
Show whitespace changes
Inline Side-by-side
Distributed-Model.md
View page @ 6d99a987
...@@ -108,23 +108,12 @@ Comparison of the update phase, without considering task overlapping. We will me ...@@ -108,23 +108,12 @@ Comparison of the update phase, without considering task overlapping. We will me
| Strategy 3 | Not tested | Not tested | | Strategy 3 | Not tested | Not tested |
Note that the strategy 3 is the one used in the GPU version. However, it is much simpler for us to use the strategy 2 with the task-based implementation. Note that the strategy 3 is the one used in the GPU version. However, it is much simpler for us to use the strategy 2 with the task-based implementation.
From now, we will use strategy 2.
# Multi-reduce + broadcast communications # Benchmarking
Some traces, with 4 ranks, 1 rank/socket, B = 4, T=1024
![RC_legend](uploads/74660146f17bc0d52b76d2d48dd95455/RC_legend.png)
![RC_implementation](uploads/12ae0ed35cb016331ce4da4a3f7d9aac/RC_implementation.png)
![iteration_mpi](uploads/7202d2550d704ad38596b59207ea9c92/iteration_mpi.png)
![iteration_mpi.code_legend](uploads/f5c599b62dfea7cfd3060f857889d9d1/iteration_mpi.code_legend.png)
![mpi_comms](uploads/cc578008512fec4a6c810fd4936d588f/mpi_comms.png)
![mpi_comms.code_legend](uploads/b2a0c1e1c7f98529751cbe14084768aa/mpi_comms.code_legend.png)
## Increasing the number of ranks
Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) : Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) :
| Number of ranks | Communication time from last to first | Communication time from first to last | | Number of ranks | Communication time from last to first | Communication time from first to last |
...@@ -137,8 +126,6 @@ Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) : ...@@ -137,8 +126,6 @@ Communication cost with number of ranks (1rank/socket, B=4*worldsize, T=1024) :
As you can see, we have a better communication time with 4 ranks. This is because we can adjust the communication block's sizes. Currently, the size have been manually adjusted for 4 ranks, explaining why the best communications time have been set for 4 ranks. As you can see, we have a better communication time with 4 ranks. This is because we can adjust the communication block's sizes. Currently, the size have been manually adjusted for 4 ranks, explaining why the best communications time have been set for 4 ranks.
# Increasing the number of ranks
With B=16, T=1024, 1 rank/socket With B=16, T=1024, 1 rank/socket
| Number of ranks | time/iteration | tokens/s | tokens/(s.cpus) | | Number of ranks | time/iteration | tokens/s | tokens/(s.cpus) |
...@@ -150,7 +137,7 @@ With B=16, T=1024, 1 rank/socket ...@@ -150,7 +137,7 @@ With B=16, T=1024, 1 rank/socket
| 16 | 4084 ms | 4011 | 4.48 | | 16 | 4084 ms | 4011 | 4.48 |
# Increasing the batch size ## Increasing the batch size
4 ranks, 1 rank/socket, T=1024 4 ranks, 1 rank/socket, T=1024
......
Clone repository

GPT2 Parallelization and Porting

  • Model Description
  • Runtime and Performances
  • Fork Join Model
  • Task Based Model
  • Distributed Model
  • Various Informations