tchatela · c22a088b
--- a/GPT2-Parallelization-and-porting.md
+++ b/GPT2-Parallelization-and-porting.md
@@ -75,7 +75,9 @@ This part is dedicated to facilitate the comprehension of the code given in the
 ### Variables for backward propagation
-In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers. All of these vectors are contained into the field _acts_ of the GPT2 model's data structure.
+In order to calculate our gradients for our back propagation, we have to keep an history of all the inputs/outputs of each layer. Besides, for layernorm and attention, we also need to keep an history of vectors that are computed inside these layers. 
+All of these vectors are contained into the field _acts_ of the GPT2 model's data structure.
 - **inputs** : Contains all the tokens for the current batch (B, T)
 - **encoded** : Output of the positional_encoding layer (B, T, C)
@@ -96,7 +98,7 @@ In order to calculate our gradients for our backpropagation, we have to keep an
 - **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
 - **residual3** : Ouput of the second residual layer in transformer block (L, B, T, C)
-**_NB : Like what was mentioned in previous part, the dimension L is here because we have L layers of transformers. When passing all these parameters to our layer functions, this extra dimension will be removed._**
+**_NB : Just like what was mentioned in previous part, the dimension L is here because we have L layers of transformers. When passing all these parameters to our layer functions, this extra dimension will be removed as we will be inside a layer._**
 ![GPT2-inputs](uploads/4f72849b1637d91dcbe216cc24e417d1/GPT2-inputs.png)
@@ -104,7 +106,7 @@ In order to calculate our gradients for our backpropagation, we have to keep an
 Here is the order in which the functions are called in the model. You will find their description in below chapters:
-1. **function_name** _input_size -> output_size_ // _parameter1 parameter2..._
+1. **function_name** _input_shape -> output_shape_ // _parameter1 parameter2..._
 2. **encoder_forward** _(B,T) -> (B,T,C)_ // _wte wpe_
 3. transformer layer repeated L times, composed of :
    1. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln1w ln1b_
@@ -133,12 +135,12 @@ The first layer's role is to encode each token into an array, called channel, th
 This layer will depend on two arrays of parameters:
-+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value)
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value\]
-+ **wpe (V,C)** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
+ **wpe (V,C)** : To each token position into the sequence array (the array of size T), this weights assign a channel of weights. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
-**Conclusion** : For one token, the array generated will be wte\[token_value\] + wpe\[t\]
+**Conclusion** : For one token, the array (of size C) generated will calculated by : wte\[token_value\] + wpe\[t\]
-_Reminder : a channel is an array of size C (768) which represents a token_
+_Reminder : a channel is an array of size C (768) which represents a token._
 ### Normalization
@@ -146,16 +148,17 @@ These layers always receive and give back matrices of shape (B,T,C).
 First, this layer is doing a basic normalization over each channel. Then, it will scale each normalized value by using ln1w and ln1b (both of shape C) weights and bias. Given a normalized channel, each value will be multiplied then added to its corresponding weight and bias from ln1w and ln1b. 
+This layer keeps also an history of the mean and <span dir="">reciprocal standard deviation</span> for the backward pass.
 ### Matmul
 This operation is always used to compute linear layers. The input are:
- A matrix of shape (B,T,k\*C),where k is an integer
+- A matrix of shape (B,T,k\*C),where k is a positive integer
 - A matrix of weights of shape (B,k\*C, OC)
 - A matrix of bias of shape (B,OC)
-The output will be a matrix of size (B,T,OC). _OC_ stands for _Output Channels_.
+The output will be a matrix of size (B,T,OC). _OC_ stands for _Output Channels_.\
 Basically, if the inputs are called A, W and B. The output will be the result of _A.W + B_.
 In the code, a tiled matmul has been implemented, using FMAs and OpenMP to increase the performances.
@@ -170,10 +173,16 @@ In a sentence, some words, when put alone, have no meaning. For example, in the
 - **Keys** _K_ of shape (B,T,C)
 - **Values** _V_ of shape (B,T,C)
-To explain what these 3 arrays are, we will take a web research as an example. In this example, one query vector would be what you entered in your search bar, one key vector would be all the labels and data of one website, and one value vector would be a link a website. Therefore, doing the dot product between your query and the key vectors of all the website, and then normalize the result, would give for each website the probability that it matches your request. Then multiply all these probabilities by the value vectors of the corresponding website to get your links with an idea of relevancy. In our case, this will allow the model to compute the relation between each words of a sentence. Let's apply attention to the sentence "I like this house". Here below would be the attention matrix that we could obtain with this sentence. Note that the attention matrix is the matrix obtained just before the multiplication with the value vectors.
+To explain what these 3 arrays are, we will take a web research as an example. In this example, one query vector would be what you entered in your search bar, one key vector would be all the labels and data of one website, and one value vector would be a link a website. Therefore, calculating the dot product between your query and the key vectors of all the website, and then normalize the result, would give for each website the probability that it matches your request (this matrix is called matrix of attention). Then multiply all these probabilities by the value vectors of the corresponding website to get your links with a notion of relevancy.\
+To sum up, you have : \
+1\. normalize(Query.Key) = Matrix of attention\
+2\. Matrix of attention.Values = Output of attention layer 
+In our case, this will allow the model to understand the relation between each words of a sentence. Let's apply attention to the sentence "I like this house". Here below would be the attention matrix that we could obtain with this sentence. \
+:warning:  The attention matrix is the matrix obtained just before the dot product with the value vectors.
 The Query, Key and Value vectors are obtained by the operation : _token-array . qkvw-parameters-array + qkvb-parameters-array_ \
-Be aware that they have already been computed when entering _attention_forward_. In reality, the two linear layers shown in the GPT-2 model are computed outside the _attention_forward_ function, using the _matmul_forward_ function.
+Be aware that they have already been computed when entering _attention_forward_. Actually, the two linear layers shown in the GPT-2 model are computed outside the _attention_forward_ function, using the _matmul_forward_ function.
 ![Multi-headAttention4](uploads/261d802297d30ba7534215e617e8a589/Multi-headAttention4.png)
@@ -181,11 +190,12 @@ Be aware that they have already been computed when entering _attention_forward_.
 ### Masked attention
-Masked attention is very simple. During training phase, as you can above, the model have access to words _that have not already been produced_. This means that it can compute the attention of "like" with "house" even though "house" should have been absent from the token sequence as the time we are computing the attention for "like". Therefore, we need to do **masked attention**, which consists at putting zeros values to the attention of words in the future, as can be seen just below.
+Masked attention is very simple. During training phase, as you can see above, the model have access to words _that have not already been produced_. This means that it can compute the attention of "like" with "house" even though "house" should have been absent from the token sequence at the time we are computing the attention for "like". To make it short, a token can only compute attention between itself and all the tokens before itself.\
+Therefore, we need to execute **masked attention**, which consists at putting zeros values to the attention of words in the future, as can be seen just below.
 ![Multi-head_attention](uploads/a7165d88c6376c68e4584e26f0a248e6/Multi-head_attention.png)
-We also re-normalized each row to get a sum of 1 every row.
+In the model we use, the normalization is only done after the masking.
 ### Multi-head masked attention
@@ -193,7 +203,7 @@ Remember that each token is represented by an array of size C. Therefore, for ea
 ![Multi-headAttention3](uploads/9a3999a8834f4e0f68aee23f1f648c88/Multi-headAttention3.png)
-The question is : what is the shape of this 3D matrix ? To answer this, we need to go a bit more in the details of multi-head attention.
+The question is : what is the shape of this 3D matrix ? To answer this, we need to go a bit further in the details of multi-head attention.
 ![Multi-headAttention5](uploads/b436daa2df0ab3fd691ce0f28031a96b/Multi-headAttention5.png)
@@ -202,7 +212,7 @@ First, we need to introduce some vocabulary:
 * **NH :** it is the Number of Heads, NH must be a divider of C
 * **hs :** is equal to C / NH
-The principle of multi-head attention is to divide each Q,K,V into _heads_. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any moment in the process of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
+The principle of multi-head attention is to divide each Q,K,V into _heads_. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any time of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
 ### From attention matrix to the output of the attention layer
@@ -210,7 +220,7 @@ The last step is to calculate the dot product between our multi-head attention m
 ![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png)
-To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are calculating the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
+To do this, we still need to take each head separately. Then, we then calculate the dot product between one head of the multi-head attention matrix, and the matching head value vector of each attention head vector. Furthermore, as we are calculating the dot product between an array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
 Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
@@ -224,9 +234,9 @@ This layer simply applies the Gelu function to all elements of the given input (
 ### Softmax layer
-For this layer, input and output is of shape (B,T,Vp).
+For this layer, input and output are of shape (B,T,Vp).
-This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
+This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any probability from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
 ## Backward propagation
@@ -234,7 +244,7 @@ For backward propagation, we are using the <span dir="">\_crossentropy \_</span>
 ![63d2878f8b9a1378924ba771_Formula-9](uploads/7e9876747d85bdb9b53acba639a7186b/63d2878f8b9a1378924ba771_Formula-9.png)
-Therefore, we are comparing the vector of probabilities given in the output of the softmax layer with the vector of known true probabilities (a vector of zeros and a single 1). Thus, by getting the derivation of the crossentropy function, we can get our gradients to shift the model parameters. For more information about how back propagation works, see https://en.wikipedia.org/wiki/Backpropagation .
+Therefore, we are comparing the vector of probabilities given in the output of the softmax layer with the vector of known true probabilities (a vector full of zeros with a single 1). Thus, by getting the derivation of the crossentropy function, we can get our gradients to shift the model parameters. For more information about how back propagation works, see https://en.wikipedia.org/wiki/Backpropagation .
 The optimization function used for model update is [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html), with parameters:
@@ -248,17 +258,17 @@ The optimization function used for model update is [AdamW](https://pytorch.org/d
 ## Sequential
-Concerning the sequential version, we can see, as expected, that every iteration (model forward + model backward + weights update). The sequential version has been run 4 times, and the results displayed are the mean of these 4 iterations.
+Concerning the sequential version, we can see, as expected, that every iteration (model forward + model backward + weights update) takes the same amount of time. The sequential version has been run 4 times, and the results displayed are the mean of these 4 iterations.
 ![seq2](uploads/2aac9034c025ab01d24e32f162b6543c/seq2.png)
 ![seq](uploads/2562f07ab3d29743f8ef78b2d71d8515/seq.png)
-The time taken for each iteration is approximately 35 285 ms. Thus, by running the model over 40 iterations, we get a total runtime of 24min06.
+The time taken for each iteration is approximately 35s. Thus, by running the model over 40 iterations, we get a total runtime of 24min06.
 ## OpenMP
-About the OpenMP version, the model has been run 40 times, with 40 iterations using 112 cpus. The average runtime per iteration is 1430 ms, and the total runtime is 58 seconds.
+About the OpenMP version, the model has been run 40 times, with 40 iterations, and using 112 cpus. The average runtime per iteration is 1430 ms, and the total runtime is 58 seconds.
 Speedup = 24.1 / 0.98 = 24.7\
 Efficiency = 0.22
@@ -269,12 +279,12 @@ Efficiency = 0.22
 ## OpenMP/n-OS-V
-About the OpenMP/nOS-V version, the model has been run 40 times, with 40 iterations using 112 cpus. The average runtime per iteration is 1616 ms, and the total runtime is 66 seconds.
+About the OpenMP/nOS-V version, the model has been run 40 times, with 40 iterations, and using 112 cpus. The average runtime per iteration is 1616 ms, and the total runtime is 66 seconds.
 Speedup = 24.1 / 1.10 = 21.8\
 Efficiency = 0.19
-What is unexpected is that the OpenMP/nOS-V version is slower than OpenMP version. For now, I don't know if this is due to a configuration error of OpenMP/nOS-V or not.
+What is unexpected is that the OpenMP/nOS-V version is slower than OpenMP version. For now, I don't know if this is due to a mistake of configuration on OpenMP/nOS-V or not.
 ![openmpv1](uploads/07cb4e4d83ce40cb9b7b8804d64f0e2f/openmpv1.png)
@@ -290,3 +300,5 @@ Here are the next steps I plan to work on :
 * Continue to upgrade the tools I have in order to get a testing pipeline that gives me directly all performances and graphs for a version of the program.
 * Read related work on GPT2 parralelization
 * Investigate on why the tokenizing of the tinystories dataset crashes
+These steps have reported in the [issues section](https://gitlab.bsc.es/tchatela/llm.c-gpt2/-/issues).
\ No newline at end of file