... | ... | @@ -177,7 +177,7 @@ The principle of multi-head attention is to divide each Q,K,V into _head_s. As y |
|
|
|
|
|
The last step is to do the dot product between our multi-head attention matrix and the value vectors.
|
|
|
|
|
|
![Multi-headAttention7](uploads/6caf9040941d8c182941d70d7fe215c5/Multi-headAttention7.png)
|
|
|
![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png)
|
|
|
|
|
|
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are doing the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
|
|
|
|
... | ... | |