... | @@ -202,7 +202,7 @@ First, we need to introduce some vocabulary: |
... | @@ -202,7 +202,7 @@ First, we need to introduce some vocabulary: |
|
* **NH :** it is the Number of Heads, NH must be a divider of C
|
|
* **NH :** it is the Number of Heads, NH must be a divider of C
|
|
* **hs :** is equal to C / NH
|
|
* **hs :** is equal to C / NH
|
|
|
|
|
|
The principle of multi-head attention is to divide each Q,K,V into \_head_s. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any moment in the process of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
|
|
The principle of multi-head attention is to divide each Q,K,V into _heads_. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any moment in the process of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
|
|
|
|
|
|
### From attention matrix to the output of the attention layer
|
|
### From attention matrix to the output of the attention layer
|
|
|
|
|
... | @@ -230,7 +230,7 @@ This layer is the last one of the model. Its purpose is to output understandable |
... | @@ -230,7 +230,7 @@ This layer is the last one of the model. Its purpose is to output understandable |
|
|
|
|
|
## Backward propagation
|
|
## Backward propagation
|
|
|
|
|
|
For backward propagation, we are using the <span dir="">_crossentropy _</span>error function.
|
|
For backward propagation, we are using the <span dir="">\_crossentropy \_</span>error function.
|
|
|
|
|
|
![63d2878f8b9a1378924ba771_Formula-9](uploads/7e9876747d85bdb9b53acba639a7186b/63d2878f8b9a1378924ba771_Formula-9.png)
|
|
![63d2878f8b9a1378924ba771_Formula-9](uploads/7e9876747d85bdb9b53acba639a7186b/63d2878f8b9a1378924ba771_Formula-9.png)
|
|
|
|
|
... | @@ -248,12 +248,45 @@ The optimization function used for model update is [AdamW](https://pytorch.org/d |
... | @@ -248,12 +248,45 @@ The optimization function used for model update is [AdamW](https://pytorch.org/d |
|
|
|
|
|
## Sequential
|
|
## Sequential
|
|
|
|
|
|
Performance of the sequential model
|
|
Concerning the sequential version, we can see, as expected, that every iteration (model forward + model backward + weights update). The sequential version has been run 4 times, and the results displayed are the mean of these 4 iterations.
|
|
|
|
|
|
|
|
![seq2](uploads/2aac9034c025ab01d24e32f162b6543c/seq2.png)
|
|
|
|
|
|
|
|
![seq](uploads/2562f07ab3d29743f8ef78b2d71d8515/seq.png)
|
|
|
|
|
|
|
|
The time taken for each iteration is approximately 35 285 ms. Thus, by running the model over 40 iterations, we get a total runtime of 24min06.
|
|
|
|
|
|
## OpenMP
|
|
## OpenMP
|
|
|
|
|
|
Performance of the model with OpenMP
|
|
About the OpenMP version, the model has been run 40 times, with 40 iterations using 112 cpus. The average runtime per iteration is 1430 ms, and the total runtime is 58 seconds.
|
|
|
|
|
|
|
|
Speedup = 24.1 / 0.98 = 24.7\
|
|
|
|
Efficiency = 0.22
|
|
|
|
|
|
|
|
![openmp2](uploads/aa9985dcfd64e916ad808b7845b42024/openmp2.png)
|
|
|
|
|
|
|
|
![openmp1](uploads/390fb74ac96bed684a40eef47f49672f/openmp1.png)
|
|
|
|
|
|
## OpenMP/n-OS-V
|
|
## OpenMP/n-OS-V
|
|
|
|
|
|
Performance of the model with OpenMP/n-OS-V |
|
About the OpenMP/nOS-V version, the model has been run 40 times, with 40 iterations using 112 cpus. The average runtime per iteration is 1616 ms, and the total runtime is 66 seconds.
|
|
\ No newline at end of file |
|
|
|
|
|
Speedup = 24.1 / 1.10 = 21.8\
|
|
|
|
Efficiency = 0.19
|
|
|
|
|
|
|
|
What is unexpected is that the OpenMP/nOS-V version is slower than OpenMP version. For now, I don't know if this is due to a configuration error of OpenMP/nOS-V or not.
|
|
|
|
|
|
|
|
![openmpv1](uploads/07cb4e4d83ce40cb9b7b8804d64f0e2f/openmpv1.png)
|
|
|
|
|
|
|
|
![openmpv2](uploads/c01dc0a441e6854fdc4319ec2861be08/openmpv2.png)
|
|
|
|
|
|
|
|
# Next steps
|
|
|
|
|
|
|
|
Here are the next steps I plan to work on :
|
|
|
|
|
|
|
|
* Analyze the paraver trace that I got for the OpenMP version
|
|
|
|
* Configure OVNI for the OpenMP/n-OS-V to get a trace
|
|
|
|
* Investigate on why the first iteration is faster for the sequential version and slower for the openMP versions
|
|
|
|
* Continue to upgrade the tools I have in order to get a testing pipeline that gives me directly all performances and graphs for a version of the program.
|
|
|
|
* Read related work on GPT2 parralelization
|
|
|
|
* Investigate on why the tokenizing of the tinystories dataset crashes |
|
|
|
\ No newline at end of file |