... | ... | @@ -37,4 +37,10 @@ With everything that has been stated, we can now create the following data flow |
|
|
We will define two types of dependencies. The first one will be dependencies around tokens, to allow the overlapping of the different layers computation. The second one will allow the overlapping of the layers with the gradients reset and update phase.
|
|
|
For the dependencies around tokens, we create an array which we will use to define our dependencies. One element of this array defines the dependencies of one block of tokens.
|
|
|
For the dependencies around the update and reset phase, we use the first element of each parameters, gradients, activation and gradients activation array slices. In fact, each of these arrays can be decomposed into subarrays, (buffers per kernels), which can, for some of them, be sliced again according if they are used inside a transformer block or not. We are using as dependencies the first element of each innermost slice.
|
|
|
Then, most of the task implementation will be composed of tasks defined around group of tokens, and you will find another taskloop inside of this task. |
|
|
\ No newline at end of file |
|
|
Then, most of the task implementation will be composed of tasks defined around group of tokens, and you will find another taskloop inside of this task.
|
|
|
|
|
|
# Remove nested tasks
|
|
|
|
|
|
We would like to leverage taskiter to remove the overhead of task creation. However, as taskiter is not applying on nested task, we would like to remove them as most as we can. The current structure is :
|
|
|
`for (int token=0; token < T; token++) {
|
|
|
` |
|
|
\ No newline at end of file |