... | ... | @@ -15,7 +15,7 @@ However, the model implemented in [the GitHub repository](https://github.com/kar |
|
|
+ The model's last layer is softmax
|
|
|
+ The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions
|
|
|
|
|
|
Therefore, here is a rectified sketch of the model implented : 
|
|
|
Therefore, here is a rectified sketch of the model implemented : 
|
|
|
|
|
|
## Model's size
|
|
|
|
... | ... | @@ -78,19 +78,19 @@ Here is the order in which the functions are called in the model. You will find |
|
|
1. **function_name** _input_size -> output_size_ // _parameter1 parameter2..._
|
|
|
2. **encoder_forward** _(B,T) -> (B,T,C)_ // _wte wpe_
|
|
|
3. transformer layer repeated L times, composed of :
|
|
|
1. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln1w ln1b_
|
|
|
2. **matmul_forward** _(B,T,C) -> (B,T,3C)_ // _qkvw qkvb_
|
|
|
3. **attention_forward** _(B,T,3C) -> (B,T,C)_
|
|
|
4. **matmul_forward** _(B,T,C) -> (B,T,C)_ // _attprojw attprojb_
|
|
|
5. **residual_forward** _(B,T,C) + (B,T,C) -> (B,T,C)_
|
|
|
6. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln2w ln2b_
|
|
|
7. **matmul_forward** _(B,T,C) -> (B,T,4C)_ // _fcw fcb_
|
|
|
8. **gelu_forward** _(B,T,4C) -> (B,T,4C)_
|
|
|
9. **matmul_forward** _(B,T,4C) -> (B,T,C)_ // _fcprojw fcprojb_
|
|
|
10. **residual_forward** (_B,T,C) + (B,T,C) -> (B,T,C)_
|
|
|
4. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _lnfw lnfb_
|
|
|
5. **matmul_forward** _(B,T,C) -> (B,T,Vp)_ // _wte_
|
|
|
6. **softmax_forward** _(B,T,Vp) -> (B,T,Vp)_
|
|
|
1. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln1w ln1b_
|
|
|
2. **matmul_forward** _(B,T,C) -> (B,T,3C)_ // _qkvw qkvb_
|
|
|
3. **attention_forward** _(B,T,3C) -> (B,T,C)_
|
|
|
4. **matmul_forward** _(B,T,C) -> (B,T,C)_ // _attprojw attprojb_
|
|
|
5. **residual_forward** _(B,T,C) + (B,T,C) -> (B,T,C)_
|
|
|
6. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln2w ln2b_
|
|
|
7. **matmul_forward** _(B,T,C) -> (B,T,4C)_ // _fcw fcb_
|
|
|
8. **gelu_forward** _(B,T,4C) -> (B,T,4C)_
|
|
|
9. **matmul_forward** _(B,T,4C) -> (B,T,C)_ // _fcprojw fcprojb_
|
|
|
10. **residual_forward** (_B,T,C) + (B,T,C) -> (B,T,C)_
|
|
|
4. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _lnfw lnfb_
|
|
|
5. **matmul_forward** _(B,T,C) -> (B,T,Vp)_ // _wte_
|
|
|
6. **softmax_forward** _(B,T,Vp) -> (B,T,Vp)_
|
|
|
|
|
|
### Token embedding
|
|
|
|
... | ... | @@ -98,12 +98,12 @@ First, let's focus on the shape of the data we receive. For each batch, we recei |
|
|
|
|
|
The first layer's role is to encode each token into an array, called channel, that will abstract its meaning and its position into the sentence. Therefore, the layer will encode each token into a channel of size (C), making an output matrix of shape (B, T, C).
|
|
|
|
|
|

|
|
|

|
|
|
|
|
|
This layer will depend on two arrays of parameters:
|
|
|
|
|
|
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value)
|
|
|
+ **wpe** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
|
|
|
+ **wpe (V,C)** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
|
|
|
|
|
|
**Conclusion** : For one token, the array generated will be wte\[token_value\] + wpe\[t\]
|
|
|
|
... | ... | @@ -143,33 +143,19 @@ Let's take Q from one token A (an array of shape C) and K from another token B ( |
|
|
|
|
|
Finishing later
|
|
|
|
|
|
#### In the code
|
|
|
|
|
|
### Residual layer
|
|
|
|
|
|
The purpose of this layer is to increase the performance of the gradient back propagation. To do so, we allow old data matrices to reintroduce in the current matrices, by adding them together.
|
|
|
|
|
|
### Gelu layer
|
|
|
|
|
|
This layer simply applies the Gelu function to all elements of the given input (of shaoe B,T,4C). The Gelu function is drawn just below. 
|
|
|
This layer simply applies the Gelu function to all elements of the given input (of shape B,T,4C). The Gelu function is drawn just below. 
|
|
|
|
|
|
### Softmax layer
|
|
|
|
|
|
For this layer, input and output is of shape (B,T,Vp).
|
|
|
|
|
|
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax funtion, we end up with an output of shape (B,T,Vp). To each token, a noralized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
|
|
|
|
|
|
## Model's data
|
|
|
|
|
|
Introducing the main variables (V, NH, L ...) Describing the matrices Short
|
|
|
|
|
|
## Advanced description
|
|
|
|
|
|
Explain the functions in detail. Inputs, outputs and variables
|
|
|
|
|
|
## Variable dictionary
|
|
|
|
|
|
Dictionary of the model's parameters
|
|
|
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
|
|
|
|
|
|
# Model performances
|
|
|
|
... | ... | |