... | ... | @@ -80,17 +80,17 @@ In order to calculate our gradients for our backpropagation, we have to keep an |
|
|
- **inputs** : Contains all the tokens for the current batch (B, T)
|
|
|
- **encoded** : Output of the positional_encoding layer (B, T, C)
|
|
|
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
|
|
|
- **ln1_mean** : (L, B, T)
|
|
|
- **ln1_rstd** : (L, B, T)
|
|
|
- **ln1_mean** : The mean of each channel for each token in the first layernorm inside the transformer block (L, B, T)
|
|
|
- **ln1_rstd** : The <span dir="">reciprocal standard deviation</span> of each channel for each token in the first layernorm inside the transformer block (L, B, T)
|
|
|
- **qkv** : Output of the first linear layer in attention layer (L, B, T, 3C)
|
|
|
- **atty** : Output of attention function (L, B, T, C)
|
|
|
- **preatt** : (L, B, NH, T, T)
|
|
|
- **att** : (L, B, NH, T, T)
|
|
|
- **preatt** : _Query.Key_ for each head in attention function (L, B, NH, T, T)
|
|
|
- **att** : Normalized _Query.Key_ for each head in attention function (L, B, NH, T, T)
|
|
|
- **attproj** : Output of the second linear layer in attention layer (L, B, T, C)
|
|
|
- **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C)
|
|
|
- **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C)
|
|
|
- **ln2** : Output of the second normalization layer in transformer block (L, B, T, C)
|
|
|
- **ln2_mean** : (L, B, T)
|
|
|
- **ln2_rstd** : (L, B, T)
|
|
|
- **ln2_mean** : The mean of each channel for each token in the second normalization layer in transformer block (L, B, T)
|
|
|
- **ln2_rstd** : The <span dir="">reciprocal standard deviation</span> of each channel for each token in the second normalization layer in transformer block (L, B, T)
|
|
|
- **fch** : Output of the first linear layer in transformer block (L, B, T, 4C)
|
|
|
- **fch_gelu** : Output of gelu layer in transformer block (L, B, T, 4C)
|
|
|
- **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
|
... | ... | @@ -154,11 +154,11 @@ This operation is always used to compute linear layers. The input are: |
|
|
- A matrix of weights of shape (B,k\*C, OC)
|
|
|
- A matrix of bias of shape (B,OC)
|
|
|
|
|
|
The output will be a matrix of size (B,T,OC)
|
|
|
The output will be a matrix of size (B,T,OC). _OC_ stands for _Output Channels_.
|
|
|
|
|
|
Basically, if the inputs are called A, W and B. The output will be the result of _A.W + B_.
|
|
|
|
|
|
**_Missing part for matmul openMP parallelization_**
|
|
|
In the code, a tiled matmul has been implemented, using FMAs and OpenMP to increase the performances.
|
|
|
|
|
|
### Attention
|
|
|
|
... | ... | @@ -228,6 +228,22 @@ For this layer, input and output is of shape (B,T,Vp). |
|
|
|
|
|
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
|
|
|
|
|
|
## Backward propagation
|
|
|
|
|
|
For backward propagation, we are using the <span dir="">_crossentropy _</span>error function.
|
|
|
|
|
|
![63d2878f8b9a1378924ba771_Formula-9](uploads/7e9876747d85bdb9b53acba639a7186b/63d2878f8b9a1378924ba771_Formula-9.png)
|
|
|
|
|
|
Therefore, we are comparing the vector of probabilities given in the output of the softmax layer with the vector of known true probabilities (a vector of zeros and a single 1). Thus, by getting the derivation of the crossentropy function, we can get our gradients to shift the model parameters. For more information about how back propagation works, see https://en.wikipedia.org/wiki/Backpropagation .
|
|
|
|
|
|
The optimization function used for model update is [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html), with parameters:
|
|
|
|
|
|
* learning rate : 0.0001 (1e-4)
|
|
|
* beta1 : 0.9
|
|
|
* beta2 : 0.999
|
|
|
* epsilon : 1e-8
|
|
|
* weight decay : 0
|
|
|
|
|
|
# Model performances
|
|
|
|
|
|
## Sequential
|
... | ... | |