... | @@ -75,21 +75,30 @@ This part is dedicated to facilitate the comprehension of the code given in the |
... | @@ -75,21 +75,30 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
### Variables for backward propagation
|
|
### Variables for backward propagation
|
|
|
|
|
|
In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers.
|
|
In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers. All of these vectors are contained into the field _acts_ of the GPT2 model's data structure.
|
|
|
|
|
|
- **inputs** : Contains all the tokens for the current batch (B, T)
|
|
- **inputs** : Contains all the tokens for the current batch (B, T)
|
|
- **encoded** : Output of the positional_encoding layer (B, T, C)
|
|
- **encoded** : Output of the positional_encoding layer (B, T, C)
|
|
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
|
|
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
|
|
- **ln1_mean** :
|
|
- **ln1_mean** : (L, B, T)
|
|
- **ln1_rstd** :
|
|
- **ln1_rstd** : (L, B, T)
|
|
- **qkv** :
|
|
- **qkv** : Output of the first linear layer in attention layer (L, B, T, 3C)
|
|
- **atty** :
|
|
- **atty** : Output of attention function (L, B, T, C)
|
|
- **preatt** :
|
|
- **preatt** : (L, B, NH, T, T)
|
|
- **att** :
|
|
- **att** : (L, B, NH, T, T)
|
|
- **** :
|
|
- **attproj** : Output of the second linear layer in attention layer (L, B, T, C)
|
|
- **ln1_mean** :
|
|
- **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C)
|
|
|
|
- **ln2** : Output of the second normalization layer in transformer block (L, B, T, C)
|
|
|
|
- **ln2_mean** : (L, B, T)
|
|
|
|
- **ln2_rstd** : (L, B, T)
|
|
|
|
- **fch** : Output of the first linear layer in transformer block (L, B, T, 4C)
|
|
|
|
- **fch_gelu** : Output of gelu layer in transformer block (L, B, T, 4C)
|
|
|
|
- **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
|
|
|
|
- **residual3** : Ouput of the second residual layer in transformer block (L, B, T, C)
|
|
|
|
|
|
|
|
**_NB : Like what was mentioned in previous part, the dimension L is here because we have L layers of transformers. When passing all these parameters to our layer functions, this extra dimension will be removed._**
|
|
|
|
|
|
|
|
![GPT2-inputs](uploads/4f72849b1637d91dcbe216cc24e417d1/GPT2-inputs.png)
|
|
|
|
|
|
## Functions description
|
|
## Functions description
|
|
|
|
|
... | @@ -114,7 +123,6 @@ Here is the order in which the functions are called in the model. You will find |
... | @@ -114,7 +123,6 @@ Here is the order in which the functions are called in the model. You will find |
|
|
|
|
|
![GPT2-functions](uploads/e313238307e971f0bebfad95be4c507e/GPT2-functions.png)
|
|
![GPT2-functions](uploads/e313238307e971f0bebfad95be4c507e/GPT2-functions.png)
|
|
|
|
|
|
|
|
|
|
### Token embedding
|
|
### Token embedding
|
|
|
|
|
|
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
|
|
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
|
... | | ... | |