... | ... | @@ -75,21 +75,30 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
### Variables for backward propagation
|
|
|
|
|
|
In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers.
|
|
|
In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers. All of these vectors are contained into the field _acts_ of the GPT2 model's data structure.
|
|
|
|
|
|
- **inputs** : Contains all the tokens for the current batch (B, T)
|
|
|
- **encoded** : Output of the positional_encoding layer (B, T, C)
|
|
|
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
|
|
|
- **ln1_mean** :
|
|
|
- **ln1_rstd** :
|
|
|
- **qkv** :
|
|
|
- **atty** :
|
|
|
- **preatt** :
|
|
|
- **att** :
|
|
|
- **** :
|
|
|
- **ln1_mean** :
|
|
|
|
|
|
|
|
|
- **ln1_mean** : (L, B, T)
|
|
|
- **ln1_rstd** : (L, B, T)
|
|
|
- **qkv** : Output of the first linear layer in attention layer (L, B, T, 3C)
|
|
|
- **atty** : Output of attention function (L, B, T, C)
|
|
|
- **preatt** : (L, B, NH, T, T)
|
|
|
- **att** : (L, B, NH, T, T)
|
|
|
- **attproj** : Output of the second linear layer in attention layer (L, B, T, C)
|
|
|
- **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C)
|
|
|
- **ln2** : Output of the second normalization layer in transformer block (L, B, T, C)
|
|
|
- **ln2_mean** : (L, B, T)
|
|
|
- **ln2_rstd** : (L, B, T)
|
|
|
- **fch** : Output of the first linear layer in transformer block (L, B, T, 4C)
|
|
|
- **fch_gelu** : Output of gelu layer in transformer block (L, B, T, 4C)
|
|
|
- **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
|
|
|
- **residual3** : Ouput of the second residual layer in transformer block (L, B, T, C)
|
|
|
|
|
|
**_NB : Like what was mentioned in previous part, the dimension L is here because we have L layers of transformers. When passing all these parameters to our layer functions, this extra dimension will be removed._**
|
|
|
|
|
|
![GPT2-inputs](uploads/4f72849b1637d91dcbe216cc24e417d1/GPT2-inputs.png)
|
|
|
|
|
|
## Functions description
|
|
|
|
... | ... | @@ -114,7 +123,6 @@ Here is the order in which the functions are called in the model. You will find |
|
|
|
|
|
![GPT2-functions](uploads/e313238307e971f0bebfad95be4c507e/GPT2-functions.png)
|
|
|
|
|
|
|
|
|
### Token embedding
|
|
|
|
|
|
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
|
... | ... | @@ -202,7 +210,7 @@ The last step is to calculate the dot product between our multi-head attention m |
|
|
|
|
|
![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png)
|
|
|
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are calculating the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are calculating the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
|
|
|
|
|
|
Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
|
... | ... | |