tchatela · 3e5fa0f6
--- a/Task-Based-Model.md
+++ b/Task-Based-Model.md
@@ -30,47 +30,11 @@ Moreover, the aim of the backward layer is to compute the gradients, which will
 The particularity of this given order is that we don't have to fully complete a step before going to the next one. In fact, the gradients values can be set to 0 just before computing their new values during their related backward layer. This way, we can, for example, overlap a backward layer will setting the gradients of the next backward layer to 0. Also, this same idea works for the update of the forward layers' parameters, as we can update the weights of one layer during the its previous forward layer. However, keep in mind that the backward pass is the forward pass but mirrored, so the first backward layers are implying data dependencies (the update of the weights) towards the last forward layer.
-With everything that has been stated, we can now create the following data flow diagram : !![GPT-2_task_based_model](uploads/0b71cb63f70b983d90c46001bcc29702/GPT-2_task_based_model.png)
+With everything that has been stated, we can now create the following data flow diagram : [GPT-2_task_based_model](uploads/0b71cb63f70b983d90c46001bcc29702/GPT-2_task_based_model.png)
-# First implementation
+# Nested task implementation
-To begin with, we will put a taskwait between each call of the main training loop. However the rest of the program will be task-based, as shown by the following sketch:
+We will define two types of dependencies. The first one will be dependencies around tokens, to allow the overlapping of the different layers computation. The second one will allow the overlapping of the layers with the gradients reset and update phase.
+For the dependencies around tokens, we create an array which we will use to define our dependencies. One element of this array defines the dependencies of one block of tokens.
-![Task_based_implementation](uploads/f9062c74a52ed79789ea0d4022624355/Task_based_implementation.png)
+For the dependencies around the update and reset phase, we use the first element of each parameters, gradients, activation and gradients activation array slices. In fact, each of these arrays can be decomposed into subarrays, (buffers per kernels), which can, for some of them, be sliced again according if they are used inside a transformer block or not. We are using as dependencies the first element of each innermost slice.
+Then, most of the task implementation will be composed of tasks defined around group of tokens, and you will find another taskloop inside of this task.
-Which gives us these paraver traces : 
\ No newline at end of file
-![task-based-forward-pass](uploads/68803df82325ddebb8fba7165cec9896/task-based-forward-pass.png)
-![task-based-forward-pass.code_legend](uploads/405e6082260df84f652c7fc323b83312/task-based-forward-pass.code_legend.png)
-![task-based-backward-pass](uploads/36c076443d041b488c57a9deb96a0005/task-based-backward-pass.png)
-![task-based-backward-pass.code_legend](uploads/b60c10db490f7110f22830a448929411/task-based-backward-pass.code_legend.png)
-This lead us to the following performances :
-Here we tried to slice according to the token sequence only (BATCH_SUBSIZE=4), or according to the sequence token and the sequence batch (BATCH_SUBSIZE=1). The slope obtained for the second version seems better so we will keep it for the following tests.
-![Comparison_BATCH_SUBSIZE](uploads/2680eec7bdf9875f9a7b79f21ea3f565/Comparison_BATCH_SUBSIZE.png)
-Because we have a taskwait between each call of the main training loop, it is possible to get the runtime, speedup and efficiency for the forward and backward pass independently. The following diagrams give us the insight that the application is memory bound.
-![runtimeT](uploads/7e7c70a6bd6723d02927bebd614eb242/runtimeT.png)
-![tbS](uploads/6f3833c1c74dbfa74eb3c6f78dd9cf2d/tbS.png)
-![tbE](uploads/70660e5b4387ac75b7f78e79b8563ed2/tbE.png)
-We can compare now the efficiency between this version of the task based model and the fork join model
-![cmptf](uploads/7a0dd4169473042b29639600f4c28095/cmptf.png)
-# Final implementation
-Now, we want to remove the taskloops put across the main training loop to improve tasks management and to allow the use of taskiter.  
-**_The implementation has changed, this diagram is no longer relevant_**
-![Task_based_implementation_final](uploads/3551b24d36cbae27f861414cd2063322/Task_based_implementation_final.png)
-![taskiter](uploads/212d9f02a378aa4e4eb6d8e978174029/taskiter.png)
-![taskiter.code_legend](uploads/225a1f2cbda768db5d6b04e3f0322299/taskiter.code_legend.png)
-![taskiter-speddup](uploads/4e4cd0d4369d283263d1dddbbb8cf22a/taskiter-speddup.png)
\ No newline at end of file