Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • Various informations

Various informations · Changes

Page history
Create Various informations authored Sep 20, 2024 by tchatela's avatar tchatela
Hide whitespace changes
Inline Side-by-side
Various-informations.md 0 → 100644
View page @ f5992753
# What / how to run the application
All implementations are now all in the same branch ‘distributed task-based’
The implementations are :
- Sequential
- OpenMP
- OpenMP/nOS-V
- OmpSs-2
- OmpSs-2 x MPI
- Naïve OpenMP/nOS-V x MPI (using blocking communications calls)
Therefore, the 'distributed task-based' is the most updated branch , with the best performances for all the versions. The only updated README (‘Run on MN5’ section) is on this branch only.
There are 2 bash scripts, one for compilation and one for execution. See ./compile -h and ./launcher -h, or the README.
These 2 scripts are printing out the commands used, so once you now the commands you can easily by pass them if needed
# Application's structure
The code can be decomposed into 2 modules. Also, to adjust the model, especially regarding the tasks, see the ‘ompps-2_settings.h’ file. You will also find some settings in the ompss-2_wrappers.h header.
![Repo-diag](uploads/1dfe85c11766d3a6ae1199effa0d80d3/Repo-diag.png)
# About token sequence splitting
When the token sequence is large enough, splitting it can be useful. The naive way is to take sequence ie tokens with position from 0 to T, and to scatter it equally among the ranks. Thus, with ranks i would get tokens from i/T to (i+1)/T. Furthermore, it allows us to decrease the allocated memory for the 2 activations arrays (acts and grad_acts), as their size has T as a parameter. All the layers would remain the same, but with T_rank=T_world/worldsize.The only kernels to modify are :
- encoder forward and backward which needs the true position of the token inside the token sequence. This is very simple to implement
- The attention forward and backward kernels, which first need their buffers to have the same size as if they were computing the whole sequence instead of a slice. Then, these kernels are mainly composed of 2 nested for loops, the outer one with t from 0 to T, and the inner one with t2 from 0 to t. By splitting the token sequence, you can decrease the number of iterations of the outer loop to t from i/T to (i+1)/T, with i the rank id. However, you will need to add some communications : an MPI_Allgather before attention forward on the l_qkv buffer, and a MPI_Allreduce after attention backward, on dl_qkv. You will also have to shift some buffers like l_qkv and l_atty to make the computation read the right elements. Also, the MPI_Allreduce can be replaced by many MPI_Reduce. In fact, with N ranks on the sequence, you will have N MPI_Reduce, each one having a different root rank. For one MPI_Reduce, if the root rank is i, you can remove the ranks with id <= i from the reduction.
However, there is an issue with this method of splitting:
- From now, let’s call t the token position in the unscattered sequence.
- In an attention kernel, when computing the token t, you have an inner loop from 0 to t, meaning that the higher the position of the token in the sequence, the more computation you have to do.
- This way, because we have sliced the sequence equally, the ranks having the end of sequence will have more load to compute.
- Therefore, because of the communications, the ranks with low id will frequently wait for the ranks with higher id (usually, for T=1024 and 56 CPU per rank, rank 0 is waiting a bit more than 100 ms per attention_backward, and we have 12 attention backward during a forward pass).
The solution you can think about is to scatter unequally the token sequence, meaning that rank 0 would have a bigger slice than rank N-1. However, this should not be efficient, as the attention_backward_KV cannot be parallelized over t, but over t2, which means that it will not scale efficiently when increasing the slice size.
The solution I chose was to use an interleaved slice, with t from 0 to T, token with position t will be computed on rank t%(T/N). Then you will have to :
- Create an indirection table to retrieve token positions
- Re-work the dependency system
- Use MPI_Allreduce instead of MPI_Reduce
- Rework attention_forward and attention_backward kernels
In the end, I managed to get this slicing to give correct results, but I had still one kernel (attention_backward_KV) taking more time than expected, probably because I was computing unnecessary tokens. However, when looking at a trace, all kernels were having the same computation time on the attention kernels.
# Bulk useful informations or points to be improved
- The current dependencies set for the task-based version are as relaxed as possible. There is no more optimization to do.
- The task-based distributed version is not sending equal packet size as the model’s memory layout is not efficient for communications and task-based implementation
- Because of this memory layout, the efficiency of the tasks dependencies will depend on the size you choose for the update block (defined in gpt2_update)
- Model inference is currently not efficient at all (as it was not efficient in the original version, and the distributed strategy around it can be improved)
- For benchmarking, always disable model inference and model validation to get correct results
- For now, with one node, the model is more efficient with 4 ranks, 28CPUs/rank than 1 rank with 112 CPUs. (If B is large enough)
- If B*T is large on a rank, and you enable taskiter, you have to expect taskiter to take a long time to calculate the cyclic dependency graph (it can be many minutes)
- Currently, the model initialization is not the most obvious one, but you can let it as it is.
- For the dataloader, my strategy is to load the tokens on one rank and to scatter them. In the GPU version, each token is loading its tokens. I have chosen this scattering method as it is more simple for now to change the scattering shapes.
- When B*T is large, you must use tinystories as dataset
- I do not have implemented mini-batches, and this may be one of the first thing to do now. However it should be very simple to do.
Clone repository

GPT2 Parallelization and Porting

  • Model Description
  • Runtime and Performances
  • Improvements
  • Traces
  • Fork Join Model
  • Task Based Model
  • Distributed Model