... | ... | @@ -36,7 +36,7 @@ The model entry are tokens and not words. A token is a numerical value represent |
|
|
|
|
|
# Code description
|
|
|
|
|
|
This part is dedicated to facilitate the comprehension of code given in the [GitHub repository](https://github.com/karpathy/llm.c)
|
|
|
This part is dedicated to facilitate the comprehension of the code given in the [GitHub repository](https://github.com/karpathy/llm.c)
|
|
|
|
|
|
## Dictionary
|
|
|
|
... | ... | @@ -69,40 +69,95 @@ This part is dedicated to facilitate the comprehension of code given in the [Git |
|
|
- **lnfb** : Bias for the last normalization layer (C)
|
|
|
- **wte** : Weights for the last linear layer (V, C)
|
|
|
|
|
|
***NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated.***
|
|
|
**_NB : for all the parameters which shape's first dimension is L, mind that this additional dimension of size L is here because these parameters are used in the transformer layers, which are repeated L times. Thus, you must understand that this dimension is only here for storage. In the next parts, when talking about these parameters, I will not consider this dimension as it is not useful in this case to do so. You can also notice that this is why for every transformer layer, the position of each matrices of parameters are recalculated._**
|
|
|
|
|
|
## Token embedding
|
|
|
## Functions description
|
|
|
|
|
|
Here is the order in which the functions are called in the model. You will find their description just below:
|
|
|
|
|
|
1. **function_name** _input_size -> output_size_ // _parameter1 parameter2..._
|
|
|
2. **encoder_forward** _(B,T) -> (B,T,C)_ // _wte wpe_
|
|
|
3. transformer layer repeated L times, composed of :
|
|
|
1. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln1w ln1b_
|
|
|
2. **matmul_forward** _(B,T,C) -> (B,T,3C)_ // _qkvw qkvb_
|
|
|
3. **attention_forward** _(B,T,3C) -> (B,T,C)_
|
|
|
4. **matmul_forward** _(B,T,C) -> (B,T,C)_ // _attprojw attprojb_
|
|
|
5. **residual_forward** _(B,T,C) + (B,T,C) -> (B,T,C)_
|
|
|
6. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln2w ln2b_
|
|
|
7. **matmul_forward** _(B,T,C) -> (B,T,4C)_ // _fcw fcb_
|
|
|
8. **gelu_forward** _(B,T,4C) -> (B,T,4C)_
|
|
|
9. **matmul_forward** _(B,T,4C) -> (B,T,C)_ // _fcprojw fcprojb_
|
|
|
10. **residual_forward** (_B,T,C) + (B,T,C) -> (B,T,C)_
|
|
|
4. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _lnfw lnfb_
|
|
|
5. **matmul_forward** _(B,T,C) -> (B,T,Vp)_ // _wte_
|
|
|
6. **softmax_forward** _(B,T,Vp) -> (B,T,Vp)_
|
|
|
|
|
|
### Token embedding
|
|
|
|
|
|
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
|
|
|
|
|
|
The first layer's role is to encode each token into an array, called channel, that will abstract its meaning and its position into the sentence. Therefore, the layer will encode each token into a channel of size (C), making an output matrix of shape (B, T, C).
|
|
|
|
|
|
![Token_embedding_1_](uploads/ffb83174e0e50930f6533b52be569858/Token_embedding_1_.png)
|
|
|
![Token_embedding_1<span data-escaped-char><span data-escaped-char>\_</span></span>](uploads/ffb83174e0e50930f6533b52be569858/Token_embedding_1%3Cspan%20data-escaped-char%3E<span%20data-escaped-char>\_</span>%3C/span%3E.png)
|
|
|
|
|
|
This layer will depend on two arrays of parameters:
|
|
|
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte[token_value)
|
|
|
+ **wpe** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe[t]
|
|
|
|
|
|
**Conclusion** : For one token, the array generated will be wte[token_value] + wpe[t]
|
|
|
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value)
|
|
|
+ **wpe** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
|
|
|
|
|
|
*Reminder : a channel is an array of size C (768) which represents a token*
|
|
|
**Conclusion** : For one token, the array generated will be wte\[token_value\] + wpe\[t\]
|
|
|
|
|
|
## Normalization
|
|
|
_Reminder : a channel is an array of size C (768) which represents a token_
|
|
|
|
|
|
### Normalization
|
|
|
|
|
|
These layers always receive and give back matrices of shape (B,T,C).
|
|
|
|
|
|
First, this layer is doing a basic normalization over each channel. Then, it will scale each normalized value by using ln1w and ln1b (both of shape C) weights and bias. Given a normalized channel, each value will be multiplied then added to its corresponding weight and bias from ln1w and ln1b.
|
|
|
|
|
|
## Matmul
|
|
|
### Matmul
|
|
|
|
|
|
This operation is always used to compute linear layers. The input are:
|
|
|
- A matrix of shape (B,T,k*C),where k is an integer
|
|
|
- A matrix of weights of shape (B,k*C, OC)
|
|
|
|
|
|
- A matrix of shape (B,T,k\*C),where k is an integer
|
|
|
- A matrix of weights of shape (B,k\*C, OC)
|
|
|
- A matrix of bias of shape (B,OC)
|
|
|
|
|
|
The output will be a matrix of size (B,T,OC)
|
|
|
|
|
|
Basically, if the inputs are called A, W and B. The output will be the result of *A.W + B*.
|
|
|
Basically, if the inputs are called A, W and B. The output will be the result of _A.W + B_.
|
|
|
|
|
|
**_Missing part for matmul openMP parallelization_**
|
|
|
|
|
|
### Attention
|
|
|
|
|
|
#### What is attention
|
|
|
|
|
|
In a sentence, some words, when put alone, have no meaning. For example, in the sentence "I like this house, it looks great !", "it" does not mean anything alone, and "great" refers to the "table", and not to "I". Therefore, it is mandatory to introduce a value that tells the model which words are related to each other. In order to do so, for each token, we are introducing 3 values:
|
|
|
|
|
|
- **Queries** _Q_ of shape (B,T,C)
|
|
|
- **Keys** _K_ of shape (B,T,C)
|
|
|
- **Values** _V_ of shape (B,T,C)
|
|
|
|
|
|
Let's take Q from one token A (an array of shape C) and K from another token B (shape C). The dot product _Q.K_ gives a number that represents how much the token B is related to A. Now, making this dot product for each tuple (Q, K) allows you to create a board stating the relations of one token to each other. Then, normalize all these values obtained to get more proper results. Finally
|
|
|
|
|
|
Finishing later
|
|
|
|
|
|
#### In the code
|
|
|
|
|
|
### Residual layer
|
|
|
|
|
|
The purpose of this layer is to increase the performance of the gradient back propagation. To do so, we allow old data matrices to reintroduce in the current matrices, by adding them together.
|
|
|
|
|
|
### Gelu layer
|
|
|
|
|
|
This layer simply applies the Gelu function to all elements of the given input (of shaoe B,T,4C). The Gelu function is drawn just below. ![gelu](uploads/0c248ab94914bfc223093046744b59a8/gelu.png)
|
|
|
|
|
|
### Softmax layer
|
|
|
|
|
|
For this layer, input and output is of shape (B,T,Vp).
|
|
|
|
|
|
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax funtion, we end up with an output of shape (B,T,Vp). To each token, a noralized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
|
|
|
|
|
|
## Model's data
|
|
|
|
... | ... | |