... | ... | @@ -94,7 +94,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
- **lnfb** : Bias for the last normalization layer (C)
|
|
|
- **wte** : Weights for the last linear layer (V, C)
|
|
|
|
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used inside the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful to do so. You can also notice that this is why the position of each matrices of parameters are recalculated for every transformer layer._**
|
|
|
**Concerning the matrices having a leading dimension of size L:** These matrices contain parameters/inputs/outputs for layers that are all inside transformer blocks. Let's take the first normalization layer inside a transformer block. For this layer, the parameters are _ln1w_ and _ln1b_ and are stored in the model as matrices of shape (L, C). However, the matrices of parameters that we feed to _layernorm_forward_ have a shape of (C). In fact, the dimension L is here only because we have L transformer blocks, meaning that we will call _layernorm_forward_ L times with different parameters that are all stored inside _ln1w_ and _ln1b._ Thus be aware that the shape of the parameter matrices stored is different from the shape of the parameter matrices that are passed to the function inside the transformer block. Therefore, when I mention the parameter matrix _ln1w,_ I can be referring to a matrix of shape (L, C) or (C) depending on the context and what I am explaining.
|
|
|
|
|
|
### Variables for backward propagation
|
|
|
|
... | ... | @@ -121,7 +121,7 @@ All of these vectors are contained into the field _acts_ of the GPT2 model's dat |
|
|
- **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
|
|
|
- **residual3** : Ouput of the second residual layer in transformer block (L, B, T, C)
|
|
|
|
|
|
**_NB : Just like what was mentioned in previous part, the dimension L is here because we have L layers of transformers. When passing all these parameters to our layer functions, this extra dimension will be removed as we will be inside a layer._**
|
|
|
**_NB : The remark made in previous section about matrices with leading dimension L can be applied here too._**
|
|
|
|
|
|
![GPT2-inputs](uploads/4f72849b1637d91dcbe216cc24e417d1/GPT2-inputs.png)
|
|
|
|
... | ... | @@ -245,7 +245,7 @@ The last step is to calculate the dot product between our multi-head attention m |
|
|
|
|
|
To do this, we still need to take each head separately. Then, we then calculate the dot product between one head of the multi-head attention matrix, and the matching head value vector of each attention head vector. Furthermore, as we are calculating the dot product between an array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
|
|
|
|
|
|
Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, I used the shape (B, C, T) only to draw more understandable sketches. In reality, the model use matrices of dimension (B, T, C), but the idea and operations are the same.
|
|
|
|
|
|
### Residual layer
|
|
|
|
... | ... | |