|
|
This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
|
|
|
This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization tried, and how these strategies impact on the performance of the model.
|
|
|
|
|
|
# Description of the model
|
|
|
|
... | ... | @@ -8,7 +8,7 @@ The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM) |
|
|
|
|
|

|
|
|
|
|
|
However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this structure, but a slightly different version. Here are the differences:
|
|
|
However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this pipeline, but a slightly different version. Here are the differences:
|
|
|
|
|
|
+ There are no dropout layers
|
|
|
+ The second residual forward is linking with the output of the multi-head masked attention, and not the output of the normalization layer
|
... | ... | @@ -19,31 +19,31 @@ Therefore, here is a rectified sketch of the model implemented : 
|
|
|
This part is dedicated to facilitate the comprehension of the code given in the reference [GitHub repository](https://github.com/karpathy/llm.c)
|
|
|
|
|
|
## Dictionary
|
|
|
|
|
|
### Global values
|
|
|
|
|
|
- **V** : Vocabulary size (50 257)
|
|
|
- **Vp** : Padded vocabulary size (50 304)
|
|
|
- **Vp** : Padded vocabulary size (50 304) - Padded for data alignment
|
|
|
- **C** : Number of channels (768)
|
|
|
- **T** : Current token sequence length (64)
|
|
|
- **maxT** : Maximum token sequence length (1024)
|
... | ... | @@ -53,7 +53,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
### Model parameters
|
|
|
|
|
|
- **wte** : Weights for token embedding (V, C)
|
|
|
- **wte** : Weights for token embedding, shape (V, C)
|
|
|
- **wpe** : Weights for positional embedding (maxT, C)
|
|
|
- **ln1w** : Weights for the first normalization layer in transformer layer (L, C)
|
|
|
- **ln1b** : Bias for the first normalization layer in transformer layer (L, C)
|
... | ... | @@ -71,7 +71,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
- **lnfb** : Bias for the last normalization layer (C)
|
|
|
- **wte** : Weights for the last linear layer (V, C)
|
|
|
|
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated._**
|
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used inside the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful to do so. You can also notice that this is why the position of each matrices of parameters are recalculated for every transformer layer._**
|
|
|
|
|
|
### Variables for backward propagation
|
|
|
|
... | ... | |