... | ... | @@ -40,6 +40,8 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
## Dictionary
|
|
|
|
|
|
### Global values
|
|
|
|
|
|
- **V** : Vocabulary size (50 257)
|
|
|
- **Vp** : Padded vocabulary size (50 304)
|
|
|
- **C** : Number of channels (768)
|
... | ... | @@ -49,7 +51,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
- **B** : Batch size i.e. number of token sequence per batch (4)
|
|
|
- **NH** : Number of heads in attention layer (12)
|
|
|
|
|
|
---
|
|
|
### Model parameters
|
|
|
|
|
|
- **wte** : Weights for token embedding (V, C)
|
|
|
- **wpe** : Weights for positional embedding (maxT, C)
|
... | ... | @@ -71,9 +73,24 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated._**
|
|
|
|
|
|
### Variables for backward propagation
|
|
|
|
|
|
In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers.
|
|
|
|
|
|
- **inputs** : Contains all the tokens for the current batch (B, T)
|
|
|
- **encoded** : Output of the positional_encoding layer (B, T, C)
|
|
|
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
|
|
|
- **ln1_mean** :
|
|
|
- **ln1_mean** :
|
|
|
- **ln1_mean** :
|
|
|
- **ln1_mean** :
|
|
|
- **ln1_mean** :
|
|
|
|
|
|
|
|
|
|
|
|
## Functions description
|
|
|
|
|
|
Here is the order in which the functions are called in the model. You will find their description just below:
|
|
|
Here is the order in which the functions are called in the model. You will find their description in below chapters:
|
|
|
|
|
|
1. **function_name** _input_size -> output_size_ // _parameter1 parameter2..._
|
|
|
2. **encoder_forward** _(B,T) -> (B,T,C)_ // _wte wpe_
|
... | ... | @@ -92,6 +109,9 @@ Here is the order in which the functions are called in the model. You will find |
|
|
5. **matmul_forward** _(B,T,C) -> (B,T,Vp)_ // _wte_
|
|
|
6. **softmax_forward** _(B,T,Vp) -> (B,T,Vp)_
|
|
|
|
|
|
![GPT2-functions](uploads/e313238307e971f0bebfad95be4c507e/GPT2-functions.png)
|
|
|
|
|
|
|
|
|
### Token embedding
|
|
|
|
|
|
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
|
... | ... | @@ -168,18 +188,20 @@ The question is : what is the shape of this 3D matrix ? To answer this, we need |
|
|
|
|
|
First, we need to introduce some vocabulary:
|
|
|
|
|
|
* **NH :** it is the Number of Heads, NH must be divider of C
|
|
|
* **NH :** it is the Number of Heads, NH must be a divider of C
|
|
|
* **hs :** is equal to C / NH
|
|
|
|
|
|
The principle of multi-head attention is to divide each Q,K,V into _head_s. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any moment in the process of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
|
|
|
The principle of multi-head attention is to divide each Q,K,V into \_head_s. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any moment in the process of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
|
|
|
|
|
|
### From attention matrix to the output of the attention layer
|
|
|
|
|
|
The last step is to do the dot product between our multi-head attention matrix and the value vectors.
|
|
|
The last step is to calculate the dot product between our multi-head attention matrix and the value vectors.
|
|
|
|
|
|
![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png)
|
|
|
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are doing the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are calculating the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
|
|
|
|
|
|
Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
|
|
|
### Residual layer
|
|
|
|
... | ... | |