... | ... | @@ -51,7 +51,30 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
- **B** : Batch size i.e. number of token sequence per batch (4)
|
|
|
- **NH** : Number of heads in attention layer (12)
|
|
|
|
|
|
### Model parameters
|
|
|
### Model global data structure _GPT2_
|
|
|
|
|
|
* **GPT2Config config** : Contains all the parameters described just above
|
|
|
* **<span dir="">ParameterTensors params</span>** : Contains the parameters described in the "Model forward parameters" section
|
|
|
* **<span dir="">size_t param_sizes\[NUM_PARAMETER_TENSORS\] </span>**: A list of the size of each parameter array inside "params"
|
|
|
* **<span dir="">float\* params_memory </span>**: A pointer towards the first parameters array of "params"
|
|
|
* **<span dir="">size_t num_parameters</span>** : The total number of parameters in our model (124 439 808 in our case)
|
|
|
* **<span dir="">ParameterTensors grads</span>** : Contains all the gradients arrays for backpropagation
|
|
|
* **<span dir="">float\* grads_memory </span>**: A pointer towards the first array of "grads"
|
|
|
* **<span dir="">float\* m_memory</span>** : Used for AdamW optimization function. It is an array of size _num_parameters_
|
|
|
* **<span dir="">float\* v_memory</span>** : Same as _m_memory_
|
|
|
* **<span dir="">ActivationTensors acts </span>**: Contains the parameters described in the "Variables for backward propagation" section
|
|
|
* **size_t act_sizes\[NUM_ACTIVATION_TENSORS\]** : A list of the size of each parameter array inside "acts"
|
|
|
* **<span dir="">float\* acts_memory</span>** : A pointer towards the first parameters array of "acts"
|
|
|
* **<span dir="">size_t num_activations</span>** : The total number of parameters in our model for backward propagation (the sum of _act_sizes_)
|
|
|
* **<span dir="">ActivationTensors grads_acts</span>** : Contains the activation gradients
|
|
|
* **<span dir="">float\* grads_acts_memory </span>**: Pointer to the first array of "grads_acts"
|
|
|
* **int batch_size** : The size of the current batch
|
|
|
* **int seq_len** : The size of the current token sequence
|
|
|
* **<span dir="">int\* inputs</span>** : The token array for the current batch
|
|
|
* **<span dir="">int\* targets</span>** : The target array for the current batch
|
|
|
* **<span dir="">float mean_loss</span>** : Mean loss of the current batch
|
|
|
|
|
|
### Model forward parameters
|
|
|
|
|
|
- **wte** : Weights for token embedding, shape (V, C)
|
|
|
- **wpe** : Weights for positional embedding (maxT, C)
|
... | ... | @@ -75,7 +98,7 @@ This part is dedicated to facilitate the comprehension of the code given in the |
|
|
|
|
|
### Variables for backward propagation
|
|
|
|
|
|
In order to calculate our gradients for our back propagation, we have to keep an history of all the inputs/outputs of each layer. Besides, for layernorm and attention, we also need to keep an history of vectors that are computed inside these layers.
|
|
|
In order to calculate our gradients for our back propagation, we have to keep an history of all the inputs/outputs of each layer. Besides, for layernorm and attention, we also need to keep an history of vectors that are computed inside these layers.
|
|
|
|
|
|
All of these vectors are contained into the field _acts_ of the GPT2 model's data structure.
|
|
|
|
... | ... | @@ -146,7 +169,7 @@ _Reminder : a channel is an array of size C (768) which represents a token._ |
|
|
|
|
|
These layers always receive and give back matrices of shape (B,T,C).
|
|
|
|
|
|
First, this layer is doing a basic normalization over each channel. Then, it will scale each normalized value by using ln1w and ln1b (both of shape C) weights and bias. Given a normalized channel, each value will be multiplied then added to its corresponding weight and bias from ln1w and ln1b.
|
|
|
First, this layer is doing a basic normalization over each channel. Then, it will scale each normalized value by using ln1w and ln1b (both of shape C) weights and bias. Given a normalized channel, each value will be multiplied then added to its corresponding weight and bias from ln1w and ln1b.
|
|
|
|
|
|
This layer keeps also an history of the mean and <span dir="">reciprocal standard deviation</span> for the backward pass.
|
|
|
|
... | ... | @@ -176,10 +199,10 @@ In a sentence, some words, when put alone, have no meaning. For example, in the |
|
|
To explain what these 3 arrays are, we will take a web research as an example. In this example, one query vector would be what you entered in your search bar, one key vector would be all the labels and data of one website, and one value vector would be a link a website. Therefore, calculating the dot product between your query and the key vectors of all the website, and then normalize the result, would give for each website the probability that it matches your request (this matrix is called matrix of attention). Then multiply all these probabilities by the value vectors of the corresponding website to get your links with a notion of relevancy.\
|
|
|
To sum up, you have : \
|
|
|
1\. normalize(Query.Key) = Matrix of attention\
|
|
|
2\. Matrix of attention.Values = Output of attention layer
|
|
|
2\. Matrix of attention.Values = Output of attention layer
|
|
|
|
|
|
In our case, this will allow the model to understand the relation between each words of a sentence. Let's apply attention to the sentence "I like this house". Here below would be the attention matrix that we could obtain with this sentence. \
|
|
|
:warning: The attention matrix is the matrix obtained just before the dot product with the value vectors.
|
|
|
:warning: The attention matrix is the matrix obtained just before the dot product with the value vectors.
|
|
|
|
|
|
The Query, Key and Value vectors are obtained by the operation : _token-array . qkvw-parameters-array + qkvb-parameters-array_ \
|
|
|
Be aware that they have already been computed when entering _attention_forward_. Actually, the two linear layers shown in the GPT-2 model are computed outside the _attention_forward_ function, using the _matmul_forward_ function.
|
... | ... | |