|
|
# What / how to run the application
|
|
|
|
|
|
All implementations are now all in the same branch ‘distributed task-based’
|
|
|
The implementations are :
|
|
|
- Sequential
|
|
|
- OpenMP
|
|
|
- OpenMP/nOS-V
|
|
|
- OmpSs-2
|
|
|
- OmpSs-2 x MPI
|
|
|
- Naïve OpenMP/nOS-V x MPI (using blocking communications calls)
|
|
|
|
|
|
Therefore, the 'distributed task-based' is the most updated branch , with the best performances for all the versions. The only updated README (‘Run on MN5’ section) is on this branch only.
|
|
|
There are 2 bash scripts, one for compilation and one for execution. See ./compile -h and ./launcher -h, or the README.
|
|
|
These 2 scripts are printing out the commands used, so once you now the commands you can easily by pass them if needed
|
|
|
|
|
|
# Application's structure
|
|
|
|
|
|
The code can be decomposed into 2 modules. Also, to adjust the model, especially regarding the tasks, see the ‘ompps-2_settings.h’ file. You will also find some settings in the ompss-2_wrappers.h header.
|
|
|
|
|
|
![Repo-diag](uploads/1dfe85c11766d3a6ae1199effa0d80d3/Repo-diag.png)
|
|
|
|
|
|
|
|
|
# About token sequence splitting
|
|
|
|
|
|
When the token sequence is large enough, splitting it can be useful. The naive way is to take sequence ie tokens with position from 0 to T, and to scatter it equally among the ranks. Thus, with ranks i would get tokens from i/T to (i+1)/T. Furthermore, it allows us to decrease the allocated memory for the 2 activations arrays (acts and grad_acts), as their size has T as a parameter. All the layers would remain the same, but with T_rank=T_world/worldsize.The only kernels to modify are :
|
|
|
- encoder forward and backward which needs the true position of the token inside the token sequence. This is very simple to implement
|
|
|
- The attention forward and backward kernels, which first need their buffers to have the same size as if they were computing the whole sequence instead of a slice. Then, these kernels are mainly composed of 2 nested for loops, the outer one with t from 0 to T, and the inner one with t2 from 0 to t. By splitting the token sequence, you can decrease the number of iterations of the outer loop to t from i/T to (i+1)/T, with i the rank id. However, you will need to add some communications : an MPI_Allgather before attention forward on the l_qkv buffer, and a MPI_Allreduce after attention backward, on dl_qkv. You will also have to shift some buffers like l_qkv and l_atty to make the computation read the right elements. Also, the MPI_Allreduce can be replaced by many MPI_Reduce. In fact, with N ranks on the sequence, you will have N MPI_Reduce, each one having a different root rank. For one MPI_Reduce, if the root rank is i, you can remove the ranks with id <= i from the reduction.
|
|
|
|
|
|
However, there is an issue with this method of splitting:
|
|
|
- From now, let’s call t the token position in the unscattered sequence.
|
|
|
- In an attention kernel, when computing the token t, you have an inner loop from 0 to t, meaning that the higher the position of the token in the sequence, the more computation you have to do.
|
|
|
- This way, because we have sliced the sequence equally, the ranks having the end of sequence will have more load to compute.
|
|
|
- Therefore, because of the communications, the ranks with low id will frequently wait for the ranks with higher id (usually, for T=1024 and 56 CPU per rank, rank 0 is waiting a bit more than 100 ms per attention_backward, and we have 12 attention backward during a forward pass).
|
|
|
|
|
|
The solution you can think about is to scatter unequally the token sequence, meaning that rank 0 would have a bigger slice than rank N-1. However, this should not be efficient, as the attention_backward_KV cannot be parallelized over t, but over t2, which means that it will not scale efficiently when increasing the slice size.
|
|
|
The solution I chose was to use an interleaved slice, with t from 0 to T, token with position t will be computed on rank t%(T/N). Then you will have to :
|
|
|
- Create an indirection table to retrieve token positions
|
|
|
- Re-work the dependency system
|
|
|
- Use MPI_Allreduce instead of MPI_Reduce
|
|
|
- Rework attention_forward and attention_backward kernels
|
|
|
|
|
|
In the end, I managed to get this slicing to give correct results, but I had still one kernel (attention_backward_KV) taking more time than expected, probably because I was computing unnecessary tokens. However, when looking at a trace, all kernels were having the same computation time on the attention kernels.
|
|
|
|
|
|
# Bulk useful informations or points to be improved
|
|
|
- The current dependencies set for the task-based version are as relaxed as possible. There is no more optimization to do.
|
|
|
- The task-based distributed version is not sending equal packet size as the model’s memory layout is not efficient for communications and task-based implementation
|
|
|
- Because of this memory layout, the efficiency of the tasks dependencies will depend on the size you choose for the update block (defined in gpt2_update)
|
|
|
- Model inference is currently not efficient at all (as it was not efficient in the original version, and the distributed strategy around it can be improved)
|
|
|
- For benchmarking, always disable model inference and model validation to get correct results
|
|
|
- For now, with one node, the model is more efficient with 4 ranks, 28CPUs/rank than 1 rank with 112 CPUs. (If B is large enough)
|
|
|
- If B*T is large on a rank, and you enable taskiter, you have to expect taskiter to take a long time to calculate the cyclic dependency graph (it can be many minutes)
|
|
|
- Currently, the model initialization is not the most obvious one, but you can let it as it is.
|
|
|
- For the dataloader, my strategy is to load the tokens on one rank and to scatter them. In the GPU version, each token is loading its tokens. I have chosen this scattering method as it is more simple for now to change the scattering shapes.
|
|
|
- When B*T is large, you must use tinystories as dataset
|
|
|
- I do not have implemented mini-batches, and this may be one of the first thing to do now. However it should be very simple to do. |