... | ... | @@ -139,9 +139,47 @@ In a sentence, some words, when put alone, have no meaning. For example, in the |
|
|
- **Keys** _K_ of shape (B,T,C)
|
|
|
- **Values** _V_ of shape (B,T,C)
|
|
|
|
|
|
Let's take Q from one token A (an array of shape C) and K from another token B (shape C). The dot product _Q.K_ gives a number that represents how much the token B is related to A. Now, making this dot product for each tuple (Q, K) allows you to create a board stating the relations of one token to each other. Then, normalize all these values obtained to get more proper results. Finally
|
|
|
To explain what these 3 arrays are, we will take a web research as an example. In this example, one query vector would be what you entered in your search bar, one key vector would be all the labels and data of one website, and one value vector would be a link a website. Therefore, doing the dot product between your query and the key vectors of all the website, and then normalize the result, would give for each website the probability that it matches your request. Then multiply all these probabilities by the value vectors of the corresponding website to get your links with an idea of relevancy. In our case, this will allow the model to compute the relation between each words of a sentence. Let's apply attention to the sentence "I like this house". Here below would be the attention matrix that we could obtain with this sentence. Note that the attention matrix is the matrix obtained just before the multiplication with the value vectors.
|
|
|
|
|
|
Finishing later
|
|
|
The Query, Key and Value vectors are obtained by the operation : _token-array . qkvw-parameters-array + qkvb-parameters-array_ \
|
|
|
Be aware that they have already been computed when entering _attention_forward_. In reality, the two linear layers shown in the GPT-2 model are computed outside the _attention_forward_ function, using the _matmul_forward_ function.
|
|
|
|
|
|
![Multi-head_attention_4\_](uploads/694ffd3c515d9e116dd856bff354522b/Multi-head_attention_4\_.png)
|
|
|
|
|
|
![Multi-head_attention_1\_](uploads/8157f8511287942577ff8e3fc2fdd272/Multi-head_attention_1\_.png)
|
|
|
|
|
|
### Masked attention
|
|
|
|
|
|
Masked attention is very simple. During training phase, as you can above, the model have access to words _that have not already been produced_. This means that it can compute the attention of "like" with "house" even though "house" should have been absent from the token sequence as the time we are computing the attention for "like". Therefore, we need to do **masked attention**, which consists at putting zeros values to the attention of words in the future, as can be seen just below.
|
|
|
|
|
|
![Multi-head_attention](uploads/a7165d88c6376c68e4584e26f0a248e6/Multi-head_attention.png)
|
|
|
|
|
|
We also re-normalized each row to get a sum of 1 every row.
|
|
|
|
|
|
### Multi-head masked attention
|
|
|
|
|
|
Remember that each token is represented by an array of size C. Therefore, for each word with another one, we don't get one attention value, but a vector of attention values.
|
|
|
|
|
|
![Multi-head_attention_3\_](uploads/11d50badde98a23300b0ddf5aa482eb1/Multi-head_attention_3\_.png)
|
|
|
|
|
|
The question is : what is the shape of this 3D matrix ? To answer this, we need to go a bit more in the details of multi-head attention.
|
|
|
|
|
|
![Multi-head_attention_5\_](uploads/21f3f30d423c8ed7a7bea51f29f9454a/Multi-head_attention_5\_.png)
|
|
|
|
|
|
First, we need to introduce some vocabulary:
|
|
|
|
|
|
* **NH :** it is the Number of Heads, NH must be divider of C
|
|
|
* **hs :** is equal to C / NH
|
|
|
|
|
|
The principle of multi-head attention is to divide each Q,K,V into _head_s. As you can see above, K, Q and V are divided in NH parts, thus making sub arrays of size _hs_. Then, we will apply the attention mechanism we have seen earlier to each head separately. Each of the NH heads will not interact at any moment in the process of attention with another head (making parallel computation possible). Now, as we are computing a single attention value for each head, the shape of the multi-head attention matrix will be (NH, T, T).
|
|
|
|
|
|
### From attention matrix to the output of the attention layer
|
|
|
|
|
|
The last step is to do the dot product between our multi-head attention matrix and the value vectors.
|
|
|
|
|
|
![Multi-head_attention_7\_](uploads/4db46c447ddf27feaf4f8dcf3a394122/Multi-head_attention_7\_.png)
|
|
|
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are doing the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
|
|
|
### Residual layer
|
|
|
|
... | ... | |