tchatela · c6b5a060
--- a/Task-Based-Model.md
+++ b/Task-Based-Model.md
@@ -8,7 +8,6 @@ Basically, we would like to avoid the black regions (sequential portions) that w
 First, we need to analyze what parts of the model can be overlapped. Therefore, we will inspect each layer and check for their data dependencies. Then, depending on these dependencies, we will create a data flow diagram that will show the different tasks and how each one depends on a previous one.
 On a general point of view, each layer of the forward needs to have its weights and bias updated before being computed.
 As the backward pass is generating the gradient values (necessary to update the weights) in the reverse order of the forward pass, we have to wait until the end of the backward pass to begin the forward pass. In fact, the last layer computed in the backward pass is the encoder, which is the first layer of the forward.
@@ -17,20 +16,18 @@ The same logic applies for the backward pass, and we also need to wait the forwa
 With this general / too wide view of the model, it does not seems like we will be able to make overlapping tasks. Therefore, we should go more in depth to apprehend the model.
-During a forward pass, nearly each of the calculations made are channel-independent, as there is only the attention layer that is using multiple channels a one calculation. Therefore, we can set up tasks with k - channels as an input data, k a divider of T in \[1, T\], allowing us to overlap two different layers using different sets of k-channels. However, the attention layer is using all the tokens of a sentence for its computation. Thus, it is mandatory to wait for all the k-channels of a sentence before entering this layer (this is also why we choose k as a divider of T in \[1, T\]). But as the attention layer is only computing one sentence at once, we can still compute k-channels sets from other sentences at the same time as the attention layer is running.
+During a forward pass, nearly each of the calculations made are token-independent, as there is only the attention layer that is using multiple tokens for one calculation. Therefore, we can set up tasks with k - tokens as an input data, k a divider of T in \[1, T\], allowing us to overlap two different layers using different sets of k-tokens. However, the attention layer is using all the tokens of a sentence for its computation. Thus, it is mandatory to wait for all the tokens of a sentence before entering this layer (this is also why we choose k as a divider of T in \[1, T\]). But as the attention layer is only computing one sentence at once, we can still compute k-tokens sets from other sentences at the same time as the attention layer is running.
-During a backward pass, the principle of k-channels is a bit different, as the weights and biases do not have a shape depending of T. Basically, we can see this as k=1. However this is not problem as we still need to compute the data arrays for the next backward layer. Those arrays do have a shape depending of T, thus making it possible to use the notion of k-channels, which will be related to the output data (called dinp) and not to the weights or biases. We will then be able to overlap backward layers likewise the forward layers, with the same exception for the attention layer.
+During a backward pass, the principle of k-tokens is a bit different, as the weights and biases do not have a shape depending of T. Basically, we can see this as k=1. However this is not problem as we still need to compute the data arrays for the next backward layer. Those arrays do have a shape depending of T, thus making it possible to use the notion of k-tokens, which will be related to the output data (called _dinp_) and not to the weights or biases. We will then be able to overlap backward layers likewise the forward layers, with the same exception for the attention layer.
 Moreover, the aim of the backward layer is to compute the gradients, which will then be added to the weights and biases of each layer parameters. In the code, it is done this way :
 1. Forward pass
 2. Set all gradients values to 0
 3. Backward pass to update gradients
 4. Update each forward layer's parameters using the gradients
 5. New forward pass
-The particularity of this given order is that we don't have to fully complete a step before going to the next one. In fact, the gradients values can be set to 0 just before computing there new values during there related backward layer. This way, we can, for example, overlap a backward layer will setting the gradients of the next backward layer to 0. Also, this same idea works for the update of the forward layers' parameters, as we can update the weights of one layer during the its previous forward layer.
+The particularity of this given order is that we don't have to fully complete a step before going to the next one. In fact, the gradients values can be set to 0 just before computing their new values during their related backward layer. This way, we can, for example, overlap a backward layer will setting the gradients of the next backward layer to 0. Also, this same idea works for the update of the forward layers' parameters, as we can update the weights of one layer during the its previous forward layer. However, keep in mind that the backward pass is the forward pass but mirrored, so the first backward layers are implying data dependencies (the update of the weights) towards the last forward layer.
-However, keep in mind that the backward pass is the forward pass but mirrored, so the first backward layers are making data dependencies (the update of the weights) towards the last forward layer.
-With everything that has been stated, we can now create the following data flow diagram : 
+With everything that has been stated, we can now create the following data flow diagram : ![GPT-2_task_based_model](uploads/7bcafdb19c366e6dc142d182ef75b931/GPT-2_task_based_model.png)
-![GPT-2_task_based_model](uploads/7bcafdb19c366e6dc142d182ef75b931/GPT-2_task_based_model.png)
\ No newline at end of file
\ No newline at end of file