tchatela · d5ab66bb
--- a/GPT2-Parallelization-and-porting.md
+++ b/GPT2-Parallelization-and-porting.md
-This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
+This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization tried, and how these strategies impact on the performance of the model.

 # Description of the model

@@ -8,7 +8,7 @@ The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM)

 ![Basic structure of GPT-2](https://www.researchgate.net/publication/373352176/figure/fig1/AS:11431281202501967@1698856108167/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown.ppm)

-However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this structure, but a slightly different version. Here are the differences:
+However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this pipeline, but a slightly different version. Here are the differences:

 + There are no dropout layers
 + The second residual forward is linking with the output of the multi-head masked attention, and not the output of the normalization layer
@@ -19,31 +19,31 @@ Therefore, here is a rectified sketch of the model implemented : ![Adapted struc

 ## Model's size

-For GPT2, the data will go through a variable number _L_ of transformer layers, which implies a variable large number of parameters. Therefore, _L_ is one of the two parameters that can be tweaked to get an efficient model in terms of runtime and output quality. The other parameters is the number of channels _C_, which will be described later.
+For GPT2, the data will go through a variable number _L_ of transformer layers, which implies a variable large number of parameters. Therefore, _L_ is one of the two parameters that can be tweaked to get an efficient model in terms of runtime and output quality. The other parameter is the number of channels _C_, which will be explained later.

-| Name | Model's size | _L_ | _C_ |
-|------|--------------|-----|-----|
+| Name | Model's size (nb of parameters) | _L_ | _C_ |
+|------|---------------------------------|-----|-----|
 | gpt2 | 124 439 808 | 12 | 768 |
 | gpt2-medium | 354 823 168 | 24 | 1024 |
 | gpt2-large | 774 030 080 | 36 | 1280 |
 | gpt2-xl | 1 557 611 200 | 48 | 1600 |

-Mind that the values found here are different from the ones you will find on other sources, as the model structure has been slightly changed. For our application, we will use the first model size of the list above
+Mind that the values found here are different from the ones you will find on other sources, as the model structure has been slightly changed. For our application, we will use the first model size "gpt2" from the list above.

 ## Tokens

-The model entry are tokens and not words. A token is a numerical value represents a sequence of characters, and a sequence of tokens can represent a word, or a sentence. There is also another token, called "|endoftext|", which signifies that the model has finished to produce tokens. For our model's training, these tokens are provided by two datasets: _tinyshakespeare_ and _tinystories_. For GPT2, the number of distinct tokens is 50257. This value will represent the capacity of the model to make the difference between different sequences of letters.
+The model entry are tokens and not words. A token is a numerical value that represents a sequence of characters. Thus, a sequence of tokens will represent a word, or a sentence. There is also another token, called "|endoftext|", which means that the model has finished to produce tokens. For our model's training, these tokens are provided by two datasets: _tinyshakespeare_ and _tinystories_. For GPT2, the number of distinct tokens is 50257. This value will represent the capacity of the model to make the difference between different sequences of letters.

 # Code description

-This part is dedicated to facilitate the comprehension of the code given in the [GitHub repository](https://github.com/karpathy/llm.c)
+This part is dedicated to facilitate the comprehension of the code given in the reference [GitHub repository](https://github.com/karpathy/llm.c)

 ## Dictionary

 ### Global values

 - **V** : Vocabulary size (50 257)
- **Vp** : Padded vocabulary size (50 304)
+- **Vp** : Padded vocabulary size (50 304) - Padded for data alignment
 - **C** : Number of channels (768)
 - **T** : Current token sequence length (64)
 - **maxT** : Maximum token sequence length (1024)
@@ -53,7 +53,7 @@ This part is dedicated to facilitate the comprehension of the code given in the

 ### Model parameters

- **wte** : Weights for token embedding (V, C)
+- **wte** : Weights for token embedding, shape (V, C)
 - **wpe** : Weights for positional embedding (maxT, C)
 - **ln1w** : Weights for the first normalization layer in transformer layer (L, C)
 - **ln1b** : Bias for the first normalization layer in transformer layer (L, C)
@@ -71,7 +71,7 @@ This part is dedicated to facilitate the comprehension of the code given in the
 - **lnfb** : Bias for the last normalization layer (C)
 - **wte** : Weights for the last linear layer (V, C)

-**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated._**
+**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used inside the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful to do so. You can also notice that this is why the position of each matrices of parameters are recalculated for every transformer layer._**

 ### Variables for backward propagation