... | @@ -53,3 +53,4 @@ In the end, I managed to get this slicing to give correct results, but I had sti |
... | @@ -53,3 +53,4 @@ In the end, I managed to get this slicing to give correct results, but I had sti |
|
- For the dataloader, my strategy is to load the tokens on one rank and to scatter them. In the GPU version, each token is loading its tokens. I have chosen this scattering method as it is more simple for now to change the scattering shapes.
|
|
- For the dataloader, my strategy is to load the tokens on one rank and to scatter them. In the GPU version, each token is loading its tokens. I have chosen this scattering method as it is more simple for now to change the scattering shapes.
|
|
- When B*T is large, you must use tinystories as dataset
|
|
- When B*T is large, you must use tinystories as dataset
|
|
- I do not have implemented mini-batches, and this may be one of the first thing to do now. However it should be very simple to do.
|
|
- I do not have implemented mini-batches, and this may be one of the first thing to do now. However it should be very simple to do.
|
|
|
|
- I have benchmarked the application here https://docs.google.com/spreadsheets/d/1uS5uPAVtFLoj4BvirT4mke5_pVDsOm85ciNP-0iIO5M/edit?usp=sharing and I will continue benchmarking it in the next weeks for my report |