|
This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
|
|
This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization tried, and how these strategies impact on the performance of the model.
|
|
|
|
|
|
# Description of the model
|
|
# Description of the model
|
|
|
|
|
... | @@ -8,7 +8,7 @@ The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM) |
... | @@ -8,7 +8,7 @@ The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM) |
|
|
|
|
|
![Basic structure of GPT-2](https://www.researchgate.net/publication/373352176/figure/fig1/AS:11431281202501967@1698856108167/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown.ppm)
|
|
![Basic structure of GPT-2](https://www.researchgate.net/publication/373352176/figure/fig1/AS:11431281202501967@1698856108167/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown.ppm)
|
|
|
|
|
|
However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this structure, but a slightly different version. Here are the differences:
|
|
However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this pipeline, but a slightly different version. Here are the differences:
|
|
|
|
|
|
+ There are no dropout layers
|
|
+ There are no dropout layers
|
|
+ The second residual forward is linking with the output of the multi-head masked attention, and not the output of the normalization layer
|
|
+ The second residual forward is linking with the output of the multi-head masked attention, and not the output of the normalization layer
|
... | @@ -19,31 +19,31 @@ Therefore, here is a rectified sketch of the model implemented : ![Adapted struc |
... | @@ -19,31 +19,31 @@ Therefore, here is a rectified sketch of the model implemented : ![Adapted struc |
|
|
|
|
|
## Model's size
|
|
## Model's size
|
|
|
|
|
|
For GPT2, the data will go through a variable number _L_ of transformer layers, which implies a variable large number of parameters. Therefore, _L_ is one of the two parameters that can be tweaked to get an efficient model in terms of runtime and output quality. The other parameters is the number of channels _C_, which will be described later.
|
|
For GPT2, the data will go through a variable number _L_ of transformer layers, which implies a variable large number of parameters. Therefore, _L_ is one of the two parameters that can be tweaked to get an efficient model in terms of runtime and output quality. The other parameter is the number of channels _C_, which will be explained later.
|
|
|
|
|
|
| Name | Model's size | _L_ | _C_ |
|
|
| Name | Model's size (nb of parameters) | _L_ | _C_ |
|
|
|------|--------------|-----|-----|
|
|
|------|---------------------------------|-----|-----|
|
|
| gpt2 | 124 439 808 | 12 | 768 |
|
|
| gpt2 | 124 439 808 | 12 | 768 |
|
|
| gpt2-medium | 354 823 168 | 24 | 1024 |
|
|
| gpt2-medium | 354 823 168 | 24 | 1024 |
|
|
| gpt2-large | 774 030 080 | 36 | 1280 |
|
|
| gpt2-large | 774 030 080 | 36 | 1280 |
|
|
| gpt2-xl | 1 557 611 200 | 48 | 1600 |
|
|
| gpt2-xl | 1 557 611 200 | 48 | 1600 |
|
|
|
|
|
|
Mind that the values found here are different from the ones you will find on other sources, as the model structure has been slightly changed. For our application, we will use the first model size of the list above
|
|
Mind that the values found here are different from the ones you will find on other sources, as the model structure has been slightly changed. For our application, we will use the first model size "gpt2" from the list above.
|
|
|
|
|
|
## Tokens
|
|
## Tokens
|
|
|
|
|
|
The model entry are tokens and not words. A token is a numerical value represents a sequence of characters, and a sequence of tokens can represent a word, or a sentence. There is also another token, called "|endoftext|", which signifies that the model has finished to produce tokens. For our model's training, these tokens are provided by two datasets: _tinyshakespeare_ and _tinystories_. For GPT2, the number of distinct tokens is 50257. This value will represent the capacity of the model to make the difference between different sequences of letters.
|
|
The model entry are tokens and not words. A token is a numerical value that represents a sequence of characters. Thus, a sequence of tokens will represent a word, or a sentence. There is also another token, called "|endoftext|", which means that the model has finished to produce tokens. For our model's training, these tokens are provided by two datasets: _tinyshakespeare_ and _tinystories_. For GPT2, the number of distinct tokens is 50257. This value will represent the capacity of the model to make the difference between different sequences of letters.
|
|
|
|
|
|
# Code description
|
|
# Code description
|
|
|
|
|
|
This part is dedicated to facilitate the comprehension of the code given in the [GitHub repository](https://github.com/karpathy/llm.c)
|
|
This part is dedicated to facilitate the comprehension of the code given in the reference [GitHub repository](https://github.com/karpathy/llm.c)
|
|
|
|
|
|
## Dictionary
|
|
## Dictionary
|
|
|
|
|
|
### Global values
|
|
### Global values
|
|
|
|
|
|
- **V** : Vocabulary size (50 257)
|
|
- **V** : Vocabulary size (50 257)
|
|
- **Vp** : Padded vocabulary size (50 304)
|
|
- **Vp** : Padded vocabulary size (50 304) - Padded for data alignment
|
|
- **C** : Number of channels (768)
|
|
- **C** : Number of channels (768)
|
|
- **T** : Current token sequence length (64)
|
|
- **T** : Current token sequence length (64)
|
|
- **maxT** : Maximum token sequence length (1024)
|
|
- **maxT** : Maximum token sequence length (1024)
|
... | @@ -53,7 +53,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
... | @@ -53,7 +53,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
### Model parameters
|
|
### Model parameters
|
|
|
|
|
|
- **wte** : Weights for token embedding (V, C)
|
|
- **wte** : Weights for token embedding, shape (V, C)
|
|
- **wpe** : Weights for positional embedding (maxT, C)
|
|
- **wpe** : Weights for positional embedding (maxT, C)
|
|
- **ln1w** : Weights for the first normalization layer in transformer layer (L, C)
|
|
- **ln1w** : Weights for the first normalization layer in transformer layer (L, C)
|
|
- **ln1b** : Bias for the first normalization layer in transformer layer (L, C)
|
|
- **ln1b** : Bias for the first normalization layer in transformer layer (L, C)
|
... | @@ -71,7 +71,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
... | @@ -71,7 +71,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
- **lnfb** : Bias for the last normalization layer (C)
|
|
- **lnfb** : Bias for the last normalization layer (C)
|
|
- **wte** : Weights for the last linear layer (V, C)
|
|
- **wte** : Weights for the last linear layer (V, C)
|
|
|
|
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated._**
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used inside the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful to do so. You can also notice that this is why the position of each matrices of parameters are recalculated for every transformer layer._**
|
|
|
|
|
|
### Variables for backward propagation
|
|
### Variables for backward propagation
|
|
|
|
|
... | @@ -248,7 +248,7 @@ The optimization function used for model update is [AdamW](https://pytorch.org/d |
... | @@ -248,7 +248,7 @@ The optimization function used for model update is [AdamW](https://pytorch.org/d |
|
|
|
|
|
## Sequential
|
|
## Sequential
|
|
|
|
|
|
Concerning the sequential version, we can see, as expected, that every iteration (model forward + model backward + weights update). The sequential version has been run 4 times, and the results displayed are the mean of these 4 iterations.
|
|
Concerning the sequential version, we can see, as expected, that every iteration (model forward + model backward + weights update). The sequential version has been run 4 times, and the results displayed are the mean of these 4 iterations.
|
|
|
|
|
|
![seq2](uploads/2aac9034c025ab01d24e32f162b6543c/seq2.png)
|
|
![seq2](uploads/2aac9034c025ab01d24e32f162b6543c/seq2.png)
|
|
|
|
|
... | @@ -269,7 +269,7 @@ Efficiency = 0.22 |
... | @@ -269,7 +269,7 @@ Efficiency = 0.22 |
|
|
|
|
|
## OpenMP/n-OS-V
|
|
## OpenMP/n-OS-V
|
|
|
|
|
|
About the OpenMP/nOS-V version, the model has been run 40 times, with 40 iterations using 112 cpus. The average runtime per iteration is 1616 ms, and the total runtime is 66 seconds.
|
|
About the OpenMP/nOS-V version, the model has been run 40 times, with 40 iterations using 112 cpus. The average runtime per iteration is 1616 ms, and the total runtime is 66 seconds.
|
|
|
|
|
|
Speedup = 24.1 / 1.10 = 21.8\
|
|
Speedup = 24.1 / 1.10 = 21.8\
|
|
Efficiency = 0.19
|
|
Efficiency = 0.19
|
... | @@ -282,7 +282,7 @@ What is unexpected is that the OpenMP/nOS-V version is slower than OpenMP versio |
... | @@ -282,7 +282,7 @@ What is unexpected is that the OpenMP/nOS-V version is slower than OpenMP versio |
|
|
|
|
|
# Next steps
|
|
# Next steps
|
|
|
|
|
|
Here are the next steps I plan to work on :
|
|
Here are the next steps I plan to work on :
|
|
|
|
|
|
* Analyze the paraver trace that I got for the OpenMP version
|
|
* Analyze the paraver trace that I got for the OpenMP version
|
|
* Configure OVNI for the OpenMP/n-OS-V to get a trace
|
|
* Configure OVNI for the OpenMP/n-OS-V to get a trace
|
... | | ... | |