... | @@ -144,9 +144,9 @@ To explain what these 3 arrays are, we will take a web research as an example. I |
... | @@ -144,9 +144,9 @@ To explain what these 3 arrays are, we will take a web research as an example. I |
|
The Query, Key and Value vectors are obtained by the operation : _token-array . qkvw-parameters-array + qkvb-parameters-array_ \
|
|
The Query, Key and Value vectors are obtained by the operation : _token-array . qkvw-parameters-array + qkvb-parameters-array_ \
|
|
Be aware that they have already been computed when entering _attention_forward_. In reality, the two linear layers shown in the GPT-2 model are computed outside the _attention_forward_ function, using the _matmul_forward_ function.
|
|
Be aware that they have already been computed when entering _attention_forward_. In reality, the two linear layers shown in the GPT-2 model are computed outside the _attention_forward_ function, using the _matmul_forward_ function.
|
|
|
|
|
|
![Multi-head_attention_4\_](uploads/694ffd3c515d9e116dd856bff354522b/Multi-head_attention_4\_.png)
|
|
![Multi-headAttention4](uploads/261d802297d30ba7534215e617e8a589/Multi-headAttention4.png)
|
|
|
|
|
|
![Multi-head_attention_1\_](uploads/8157f8511287942577ff8e3fc2fdd272/Multi-head_attention_1\_.png)
|
|
![Multi-headAttention1](uploads/aa0bc50dce3cb7c1c2640adf15152b0c/Multi-headAttention1.png)
|
|
|
|
|
|
### Masked attention
|
|
### Masked attention
|
|
|
|
|
... | @@ -160,11 +160,11 @@ We also re-normalized each row to get a sum of 1 every row. |
... | @@ -160,11 +160,11 @@ We also re-normalized each row to get a sum of 1 every row. |
|
|
|
|
|
Remember that each token is represented by an array of size C. Therefore, for each word with another one, we don't get one attention value, but a vector of attention values.
|
|
Remember that each token is represented by an array of size C. Therefore, for each word with another one, we don't get one attention value, but a vector of attention values.
|
|
|
|
|
|
![Multi-head_attention_3\_](uploads/11d50badde98a23300b0ddf5aa482eb1/Multi-head_attention_3\_.png)
|
|
![Multi-headAttention3](uploads/9a3999a8834f4e0f68aee23f1f648c88/Multi-headAttention3.png)
|
|
|
|
|
|
The question is : what is the shape of this 3D matrix ? To answer this, we need to go a bit more in the details of multi-head attention.
|
|
The question is : what is the shape of this 3D matrix ? To answer this, we need to go a bit more in the details of multi-head attention.
|
|
|
|
|
|
![Multi-head_attention_5\_](uploads/21f3f30d423c8ed7a7bea51f29f9454a/Multi-head_attention_5\_.png)
|
|
![Multi-headAttention5](uploads/b436daa2df0ab3fd691ce0f28031a96b/Multi-headAttention5.png)
|
|
|
|
|
|
First, we need to introduce some vocabulary:
|
|
First, we need to introduce some vocabulary:
|
|
|
|
|
... | @@ -177,7 +177,7 @@ The principle of multi-head attention is to divide each Q,K,V into _head_s. As y |
... | @@ -177,7 +177,7 @@ The principle of multi-head attention is to divide each Q,K,V into _head_s. As y |
|
|
|
|
|
The last step is to do the dot product between our multi-head attention matrix and the value vectors.
|
|
The last step is to do the dot product between our multi-head attention matrix and the value vectors.
|
|
|
|
|
|
![Multi-head_attention_7\_](uploads/4db46c447ddf27feaf4f8dcf3a394122/Multi-head_attention_7\_.png)
|
|
![Multi-headAttention7](uploads/6caf9040941d8c182941d70d7fe215c5/Multi-headAttention7.png)
|
|
|
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are doing the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are doing the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
|
|
... | | ... | |