Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • GPT2 Parallelization and porting

GPT2 Parallelization and porting · Changes

Page history
Update GPT2 Parallelization and porting authored Jun 25, 2024 by tchatela's avatar tchatela
Show whitespace changes
Inline Side-by-side
GPT2-Parallelization-and-porting.md
View page @ be23b1de
...@@ -80,17 +80,17 @@ In order to calculate our gradients for our backpropagation, we have to keep an ...@@ -80,17 +80,17 @@ In order to calculate our gradients for our backpropagation, we have to keep an
- **inputs** : Contains all the tokens for the current batch (B, T) - **inputs** : Contains all the tokens for the current batch (B, T)
- **encoded** : Output of the positional_encoding layer (B, T, C) - **encoded** : Output of the positional_encoding layer (B, T, C)
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C) - **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
- **ln1_mean** : (L, B, T) - **ln1_mean** : The mean of each channel for each token in the first layernorm inside the transformer block (L, B, T)
- **ln1_rstd** : (L, B, T) - **ln1_rstd** : The <span dir="">reciprocal standard deviation</span> of each channel for each token in the first layernorm inside the transformer block (L, B, T)
- **qkv** : Output of the first linear layer in attention layer (L, B, T, 3C) - **qkv** : Output of the first linear layer in attention layer (L, B, T, 3C)
- **atty** : Output of attention function (L, B, T, C) - **atty** : Output of attention function (L, B, T, C)
- **preatt** : (L, B, NH, T, T) - **preatt** : _Query.Key_ for each head in attention function (L, B, NH, T, T)
- **att** : (L, B, NH, T, T) - **att** : Normalized _Query.Key_ for each head in attention function (L, B, NH, T, T)
- **attproj** : Output of the second linear layer in attention layer (L, B, T, C) - **attproj** : Output of the second linear layer in attention layer (L, B, T, C)
- **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C) - **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C)
- **ln2** : Output of the second normalization layer in transformer block (L, B, T, C) - **ln2** : Output of the second normalization layer in transformer block (L, B, T, C)
- **ln2_mean** : (L, B, T) - **ln2_mean** : The mean of each channel for each token in the second normalization layer in transformer block (L, B, T)
- **ln2_rstd** : (L, B, T) - **ln2_rstd** : The <span dir="">reciprocal standard deviation</span> of each channel for each token in the second normalization layer in transformer block (L, B, T)
- **fch** : Output of the first linear layer in transformer block (L, B, T, 4C) - **fch** : Output of the first linear layer in transformer block (L, B, T, 4C)
- **fch_gelu** : Output of gelu layer in transformer block (L, B, T, 4C) - **fch_gelu** : Output of gelu layer in transformer block (L, B, T, 4C)
- **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C) - **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
...@@ -154,11 +154,11 @@ This operation is always used to compute linear layers. The input are: ...@@ -154,11 +154,11 @@ This operation is always used to compute linear layers. The input are:
- A matrix of weights of shape (B,k\*C, OC) - A matrix of weights of shape (B,k\*C, OC)
- A matrix of bias of shape (B,OC) - A matrix of bias of shape (B,OC)
The output will be a matrix of size (B,T,OC) The output will be a matrix of size (B,T,OC). _OC_ stands for _Output Channels_.
Basically, if the inputs are called A, W and B. The output will be the result of _A.W + B_. Basically, if the inputs are called A, W and B. The output will be the result of _A.W + B_.
**_Missing part for matmul openMP parallelization_** In the code, a tiled matmul has been implemented, using FMAs and OpenMP to increase the performances.
### Attention ### Attention
...@@ -228,6 +228,22 @@ For this layer, input and output is of shape (B,T,Vp). ...@@ -228,6 +228,22 @@ For this layer, input and output is of shape (B,T,Vp).
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose. This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
## Backward propagation
For backward propagation, we are using the <span dir="">_crossentropy _</span>error function.
![63d2878f8b9a1378924ba771_Formula-9](uploads/7e9876747d85bdb9b53acba639a7186b/63d2878f8b9a1378924ba771_Formula-9.png)
Therefore, we are comparing the vector of probabilities given in the output of the softmax layer with the vector of known true probabilities (a vector of zeros and a single 1). Thus, by getting the derivation of the crossentropy function, we can get our gradients to shift the model parameters. For more information about how back propagation works, see https://en.wikipedia.org/wiki/Backpropagation .
The optimization function used for model update is [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html), with parameters:
* learning rate : 0.0001 (1e-4)
* beta1 : 0.9
* beta2 : 0.999
* epsilon : 1e-8
* weight decay : 0
# Model performances # Model performances
## Sequential ## Sequential
......
Clone repository
  • Distributed Model
  • Fork Join Model
  • GPT2 Parallelization and porting
  • Metrics
  • Runtime and performances
  • Task Based Model
  • Various informations
  • _sidebar
  • Home