|
|
This wiki's goal is to link the GPT-2 model with its implementation in C. It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
|
|
|
This wiki's goal is to link the GPT-2 model with its implementation in C provided by [this GitHub repository](https://github.com/karpathy/llm.c). It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
|
|
|
|
|
|
# Description of the model
|
|
|
|
... | ... | @@ -8,27 +8,107 @@ The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM) |
|
|
|
|
|
![Basic structure of GPT-2](https://www.researchgate.net/publication/373352176/figure/fig1/AS:11431281202501967@1698856108167/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown.ppm)
|
|
|
|
|
|
However, the model implemented in [our reference code](https://github.com/karpathy/llm.c) is not strictly implementing this model, but a slightly different version. Here are the differences:
|
|
|
However, the model implemented in [the GitHub repository](https://github.com/karpathy/llm.c) is not strictly implementing this structure, but a slightly different version. Here are the differences:
|
|
|
|
|
|
+ There are no dropout layers
|
|
|
+ The second residual forward is linking the output of the multi-head masked attention, and not the output of the normalization layer
|
|
|
+ The second residual forward is linking with the output of the multi-head masked attention, and not the output of the normalization layer
|
|
|
+ The model's last layer is softmax
|
|
|
+ The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions
|
|
|
|
|
|
Therefore, here is a rectified sketch of the model implented :
|
|
|
![Adapted structure of GPT-2](uploads/9055498e3cf0d4059418eab6fa8b1133/gpt2-cleaned.png)
|
|
|
Therefore, here is a rectified sketch of the model implented : ![Adapted structure of GPT-2](uploads/3f9755fa7c4882bafcc5698cc2cfe636/gpt2-cleaned.png)
|
|
|
|
|
|
## Model's size
|
|
|
|
|
|
For GPT2, the data will go through a variable number _L_ of transformer layers, which implies a variable large number of parameters. Therefore, _L_ is one of the two parameters that can be tweaked to get an efficient model in terms of runtime and output quality. The other parameters is the number of channels _C_, which will be described later.
|
|
|
|
|
|
| Name | Model's size | _L_ | _C_ |
|
|
|
|------|--------------|-----|-----|
|
|
|
| gpt2 | 124 439 808 | 12 | 768 |
|
|
|
| gpt2-medium | 354 823 168 | 24 | 1024 |
|
|
|
| gpt2-large | 774 030 080 | 36 | 1280 |
|
|
|
| gpt2-xl | 1 557 611 200 | 48 | 1600 |
|
|
|
|
|
|
Mind that the values found here are different from the ones you will find on other sources, as the model structure has been slightly changed. For our application, we will use the first model size of the list above
|
|
|
|
|
|
## Tokens
|
|
|
|
|
|
Describe token -> words and words -> tokens
|
|
|
The model entry are tokens and not words. A token is a numerical value represents a sequence of characters, and a sequence of tokens can represent a word, or a sentence. There is also another token, called "|endoftext|", which signifies that the model has finished to produce tokens. For our model's training, these tokens are provided by two datasets: _tinyshakespeare_ and _tinystories_. For GPT2, the number of distinct tokens is 50257. This value will represent the capacity of the model to make the difference between different sequences of letters.
|
|
|
|
|
|
# Code description
|
|
|
|
|
|
This part is dedicated to facilitate the comprehension of code given in the [GitHub repository](https://github.com/karpathy/llm.c)
|
|
|
|
|
|
## Dictionary
|
|
|
|
|
|
- **V** : Vocabulary size (50 257)
|
|
|
- **Vp** : Padded vocabulary size (50 304)
|
|
|
- **C** : Number of channels (768)
|
|
|
- **T** : Current token sequence length (64)
|
|
|
- **maxT** : Maximum token sequence length (1024)
|
|
|
- **L** : Number of transformer layers (12)
|
|
|
- **B** : Batch size i.e. number of token sequence per batch (4)
|
|
|
- **NH** : Number of heads in attention layer (12)
|
|
|
|
|
|
---
|
|
|
|
|
|
- **wte** : Weights for token embedding (V, C)
|
|
|
- **wpe** : Weights for positional embedding (maxT, C)
|
|
|
- **ln1w** : Weights for the first normalization layer in transformer layer (L, C)
|
|
|
- **ln1b** : Bias for the first normalization layer in transformer layer (L, C)
|
|
|
- **qkvw** : Weights for the first linear layer in multi-head attention (L, 3C, C)
|
|
|
- **qkvb** : Bias for the first linear layer in multi-head attention (L, 3C)
|
|
|
- **attprojw** : Weights for the second linear layer in multi-head attention (L, C, C)
|
|
|
- **attprojb** : Bias for the second linear layer in multi-head attention (L, C)
|
|
|
- **ln2w** : Weights for the second normalization layer in transformer layer (L, C)
|
|
|
- **ln2b** : Bias for the second normalization layer in transformer layer (L, C)
|
|
|
- **fcw** : Weights for the first linear layer in transformer layer (L, 4C, C)
|
|
|
- **fcb** : Bias for the first linear layer in transformer layer (L, 4C)
|
|
|
- **fcprojw** : Weights for the second linear layer in transformer layer (L, C, 4C)
|
|
|
- **fcprojb** : Bias for the second linear layer in transformer layer (L, C)
|
|
|
- **lnfw** : Weights for the last normalization layer (C)
|
|
|
- **lnfb** : Bias for the last normalization layer (C)
|
|
|
- **wte** : Weights for the last linear layer (V, C)
|
|
|
|
|
|
***NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated.***
|
|
|
|
|
|
## Token embedding
|
|
|
|
|
|
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
|
|
|
|
|
|
The first layer's role is to encode each token into an array, called channel, that will abstract its meaning and its position into the sentence. Therefore, the layer will encode each token into a channel of size (C), making an output matrix of shape (B, T, C).
|
|
|
|
|
|
![Token_embedding_1_](uploads/ffb83174e0e50930f6533b52be569858/Token_embedding_1_.png)
|
|
|
|
|
|
This layer will depend on two arrays of parameters:
|
|
|
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte[token_value)
|
|
|
+ **wpe** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe[t]
|
|
|
|
|
|
**Conclusion** : For one token, the array generated will be wte[token_value] + wpe[t]
|
|
|
|
|
|
*Reminder : a channel is an array of size C (768) which represents a token*
|
|
|
|
|
|
## Normalization
|
|
|
|
|
|
These layers always receive and give back matrices of shape (B,T,C).
|
|
|
|
|
|
First, this layer is doing a basic normalization over each channel. Then, it will scale each normalized value by using ln1w and ln1b (both of shape C) weights and bias. Given a normalized channel, each value will be multiplied then added to its corresponding weight and bias from ln1w and ln1b.
|
|
|
|
|
|
## Matmul
|
|
|
|
|
|
This operation is always used to compute linear layers. The input are:
|
|
|
- A matrix of shape (B,T,k*C),where k is an integer
|
|
|
- A matrix of weights of shape (B,k*C, OC)
|
|
|
- A matrix of bias of shape (B,OC)
|
|
|
|
|
|
The output will be a matrix of size (B,T,OC)
|
|
|
|
|
|
Basically, if the inputs are called A, W and B. The output will be the result of *A.W + B*.
|
|
|
|
|
|
## Data formatting
|
|
|
## Model's data
|
|
|
|
|
|
Introducing the main variables (V, NH, L ...)
|
|
|
Describing the matrices
|
|
|
Short
|
|
|
Introducing the main variables (V, NH, L ...) Describing the matrices Short
|
|
|
|
|
|
## Advanced description
|
|
|
## Advanced description
|
|
|
|
|
|
Explain the functions in detail. Inputs, outputs and variables
|
|
|
|
... | ... | |