... | @@ -20,4 +20,4 @@ With this general / too wide view of the model, it does not seems like we will b |
... | @@ -20,4 +20,4 @@ With this general / too wide view of the model, it does not seems like we will b |
|
|
|
|
|
During a forward pass, nearly each of the calculations made are channel-independent, as there is only the attention layer that is using multiple channels a one calculation. Therefore, we can set up tasks with k - channels as an input data, k a divider of T in \[1, T\], allowing us to overlap two different layers using different sets of k-channels. However, the attention layer is using all the tokens of a sentence for its computation. Thus, it is mandatory to wait for all the k-channels of a sentence before entering this layer (this is also why we choose k as a divider of T in \[1, T\]). But as the attention layer is only computing one sentence at once, we can still compute k-channels sets from other sentences at the same time as the attention layer is running.
|
|
During a forward pass, nearly each of the calculations made are channel-independent, as there is only the attention layer that is using multiple channels a one calculation. Therefore, we can set up tasks with k - channels as an input data, k a divider of T in \[1, T\], allowing us to overlap two different layers using different sets of k-channels. However, the attention layer is using all the tokens of a sentence for its computation. Thus, it is mandatory to wait for all the k-channels of a sentence before entering this layer (this is also why we choose k as a divider of T in \[1, T\]). But as the attention layer is only computing one sentence at once, we can still compute k-channels sets from other sentences at the same time as the attention layer is running.
|
|
|
|
|
|
During a backward pass, the principle of k-channels is a bit different, as we are not computing tokens anymore, but weights. However, we can still see |
|
During a backward pass, the principle of k-channels is a bit different, as the weights and biases do not have a shape depending of T. Basically, we can see this as k=1. However this is not problem |
|
\ No newline at end of file |
|
\ No newline at end of file |