... | ... | @@ -108,8 +108,37 @@ Comparison of the update phase, without considering task overlapping. We will me |
|
|
| Strategy 3 | Not tested | Not tested |
|
|
|
|
|
|
Note that the strategy 3 is the one used in the GPU version. However, it is much simpler for us to use the strategy 2 with the task-based implementation.
|
|
|
From now, we will use strategy 2.
|
|
|
|
|
|
From now, we will use strategy 2
|
|
|
|
|
|
# Update phase
|
|
|
|
|
|
It is very important to understand how the memory layout is done. What we are communicating are gradients (reduction) and parameters (broadcast). Those two arrays have exactly the same shape, and having a gradient value at position x allows to update the parameter at position x.
|
|
|
Moreover, gradients are being calculated during the backward pass, slices per slices. Thus, we can start reducing them during the backward pass. However, these slices are not produced in the same order as the gradients array layout. Therefore, you have to register in a structure where are the slice, what are their sizes and where are their OmpSs-2 dependencies pointers.
|
|
|
|
|
|
Therefore, the order in which we send the gradients is a real mess, and it would be more efficient (for sure) to change the gradients and parameters layout, even though this has not been done in the GPU version. Indeed, the memory layout is done like this : (a)
|
|
|
{
|
|
|
wte,
|
|
|
wpe,
|
|
|
ln1w[L],
|
|
|
ln1b[L],
|
|
|
qkvw[L],
|
|
|
qkvb[L],
|
|
|
attprojw[L],
|
|
|
attprojb[L],
|
|
|
ln2w[L],
|
|
|
ln2b[L],
|
|
|
fcw[L],
|
|
|
fcb[L],
|
|
|
fcprojw[L],
|
|
|
fcprojb[L],
|
|
|
lnfw,
|
|
|
lnfb
|
|
|
}
|
|
|
With L the number of transformer block (12).
|
|
|
Then, the slices are ready to be sent in this order : (b)
|
|
|
lnfb, lnfw, fcprojb[L-1],fcprojw[L-1],fcb[L-1],...,ln1w[L-1],fcprojb[L-2],...,wpe,wte
|
|
|
|
|
|
Therefore, you cannot reduce equal slices. However, it is very important that, at one instant, we use as many different roots as possible for our reduction (to avoid network congestion). Therefore, I have defined a communication block size. This block size is not the size sent by one MPI communication, but by many of them. Basically, you just parse the memory in the order of (b), (as if it was reshaped), and slice it in slices of size 'communication block size), even if these communication slices are not contiguous in memory. These block we are defining are done this way to avoid as much as we can network congestion when the data become ready. Therefore, when one communication block becomes ready, it will launch 1 or many TAMPI_Ireduce towards one specific root.
|
|
|
# Benchmarking
|
|
|
|
|
|
|
... | ... | |