tchatela · f5992753
--- a/Various-informations.md
+++ b/Various-informations.md
+# What / how to run the application
+
+All implementations are now all in the same branch ‘distributed task-based’
+The implementations are : 
+- Sequential
+- OpenMP
+- OpenMP/nOS-V
+- OmpSs-2
+- OmpSs-2 x MPI
+- Naïve OpenMP/nOS-V x MPI (using blocking communications calls)
+
+Therefore, the 'distributed task-based' is the most updated branch , with the best performances for all the versions. The only updated README (‘Run on MN5’ section) is on this branch only.
+There are 2 bash scripts, one for compilation and one for execution. See ./compile -h and ./launcher -h, or the README.
+These 2 scripts are printing out the commands used, so once you now the commands you can easily by pass them if needed
+
+# Application's structure
+
+The code can be decomposed into 2 modules. Also, to adjust the model, especially regarding the tasks, see the ‘ompps-2_settings.h’ file. You will also find some settings in the ompss-2_wrappers.h header.
+
+![Repo-diag](uploads/1dfe85c11766d3a6ae1199effa0d80d3/Repo-diag.png)
+
+
+# About token sequence splitting 
+
+When the token sequence is large enough, splitting it can be useful. The naive way is to take sequence ie tokens with position from 0 to T, and to scatter it equally among the ranks. Thus, with ranks i would get tokens from i/T to (i+1)/T. Furthermore, it allows us to decrease the allocated memory for the 2 activations arrays (acts and grad_acts), as their size has T as a parameter.  All the layers would remain the same, but with T_rank=T_world/worldsize.The only kernels to modify are : 
+- encoder forward and backward which needs the true position of the token inside the token sequence. This is very simple to implement
+- The attention forward and backward kernels, which first need their buffers to have the same size as if they were computing the whole sequence instead of a slice. Then, these kernels are mainly composed of 2 nested for loops, the outer one with t from 0 to T, and the inner one with t2 from 0 to t. By splitting the token sequence, you can decrease the number of iterations of the outer loop to t from i/T to (i+1)/T, with i the rank id. However, you will need to add some communications : an MPI_Allgather before attention forward on the l_qkv buffer, and a MPI_Allreduce after attention backward, on dl_qkv. You will also have to shift some buffers like l_qkv and l_atty to make the computation read the right elements. Also, the MPI_Allreduce can be replaced by many MPI_Reduce. In fact, with N ranks on the sequence, you will have N MPI_Reduce, each one having a different root rank. For one MPI_Reduce, if the root rank is i, you can remove the ranks with id <= i from the reduction.
+
+However, there is an issue with this method of splitting: 
+- From now, let’s call t the token position in the unscattered sequence. 
+- In an attention kernel, when computing the token t, you have an inner loop from 0 to t, meaning that the higher the position of the token in the sequence, the more computation you have to do.
+- This way, because we have sliced the sequence equally, the ranks having the end of sequence will have more load to compute.
+- Therefore, because of the communications, the ranks with low id will frequently wait for the ranks with higher id (usually, for T=1024 and 56 CPU per rank, rank 0 is waiting a bit more than 100 ms per attention_backward, and we have 12 attention backward during a forward pass).
+
+The solution you can think about is to scatter unequally the token sequence, meaning that rank 0 would have a bigger slice than rank N-1. However, this should not be efficient, as the attention_backward_KV cannot be parallelized over t, but over t2, which means that it will not scale efficiently when increasing the slice size.
+The solution I chose was to use an interleaved slice, with t from 0 to T, token with position t will be computed on rank t%(T/N). Then you will have to : 
+- Create an indirection table to retrieve token positions
+- Re-work the dependency system
+- Use MPI_Allreduce instead of MPI_Reduce
+- Rework attention_forward and attention_backward kernels
+
+In the end, I managed to get this slicing to give correct results, but I had still one kernel (attention_backward_KV) taking more time than expected, probably because I was computing unnecessary tokens.  However, when looking at a trace, all kernels were having the same computation time on the attention kernels.
+
+# Bulk useful informations or points to be improved 
+- The current dependencies set for the task-based version are as relaxed as possible. There is no more optimization to do.
+- The task-based distributed version is not sending equal packet size as the model’s memory layout is not efficient for communications and task-based implementation
+- Because of this memory layout, the efficiency of the tasks dependencies will depend on the size you choose for the update block (defined in gpt2_update)
+- Model inference is currently not efficient at all (as it was not efficient in the original version, and the distributed strategy around it can be improved)
+- For benchmarking, always disable model inference and model validation to get correct results
+- For now, with one node, the model is more efficient with 4 ranks, 28CPUs/rank than 1 rank with 112 CPUs. (If B is large enough)
+- If B*T is large on a rank, and you enable taskiter, you have to expect taskiter to take a long time to calculate the cyclic dependency graph (it can be many minutes)
+- Currently, the model initialization is not the most obvious one, but you can let it as it is.
+- For the dataloader, my strategy is to load the tokens on one rank and to scatter them. In the GPU version, each token is loading its tokens. I have chosen this scattering method as it is more simple for now to change the scattering shapes.
+- When B*T is large, you must use tinystories as dataset 
+- I do not have implemented mini-batches, and this may be one of the first thing to do now. However it should be very simple to do.