Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • GPT2 Parallelization and porting

GPT2 Parallelization and porting · Changes

Page history
Update GPT2 Parallelization and porting authored Jun 25, 2024 by tchatela's avatar tchatela
Hide whitespace changes
Inline Side-by-side
GPT2-Parallelization-and-porting.md
View page @ 3fc5db5f
...@@ -75,21 +75,30 @@ This part is dedicated to facilitate the comprehension of the code given in the ...@@ -75,21 +75,30 @@ This part is dedicated to facilitate the comprehension of the code given in the
### Variables for backward propagation ### Variables for backward propagation
In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers. In order to calculate our gradients for our backpropagation, we have to keep an history of all the outputs of each layer. Also, for layernorm and attention, we also need to keep an history of additional vectors that are computed inside these layers. All of these vectors are contained into the field _acts_ of the GPT2 model's data structure.
- **inputs** : Contains all the tokens for the current batch (B, T) - **inputs** : Contains all the tokens for the current batch (B, T)
- **encoded** : Output of the positional_encoding layer (B, T, C) - **encoded** : Output of the positional_encoding layer (B, T, C)
- **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C) - **ln1** : Output of the first layernorm inside the transformer block (L, B, T, C)
- **ln1_mean** : - **ln1_mean** : (L, B, T)
- **ln1_rstd** : - **ln1_rstd** : (L, B, T)
- **qkv** : - **qkv** : Output of the first linear layer in attention layer (L, B, T, 3C)
- **atty** : - **atty** : Output of attention function (L, B, T, C)
- **preatt** : - **preatt** : (L, B, NH, T, T)
- **att** : - **att** : (L, B, NH, T, T)
- **** : - **attproj** : Output of the second linear layer in attention layer (L, B, T, C)
- **ln1_mean** : - **residual2** : Output of the first residual layer in the transformer block (L, B, T ,C)
- **ln2** : Output of the second normalization layer in transformer block (L, B, T, C)
- **ln2_mean** : (L, B, T)
- **ln2_rstd** : (L, B, T)
- **fch** : Output of the first linear layer in transformer block (L, B, T, 4C)
- **fch_gelu** : Output of gelu layer in transformer block (L, B, T, 4C)
- **fcproj** : Output of last linear layer in transformer block (L, B, T, 4C)
- **residual3** : Ouput of the second residual layer in transformer block (L, B, T, C)
**_NB : Like what was mentioned in previous part, the dimension L is here because we have L layers of transformers. When passing all these parameters to our layer functions, this extra dimension will be removed._**
![GPT2-inputs](uploads/4f72849b1637d91dcbe216cc24e417d1/GPT2-inputs.png)
## Functions description ## Functions description
...@@ -114,7 +123,6 @@ Here is the order in which the functions are called in the model. You will find ...@@ -114,7 +123,6 @@ Here is the order in which the functions are called in the model. You will find
![GPT2-functions](uploads/e313238307e971f0bebfad95be4c507e/GPT2-functions.png) ![GPT2-functions](uploads/e313238307e971f0bebfad95be4c507e/GPT2-functions.png)
### Token embedding ### Token embedding
First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257). First, let's focus on the shape of the data we receive. For each batch, we receive B (4) sequences of T (64) tokens. Therefore, we receive a matrix of shape (B, T), containing tokens. These tokens are integers which values are ranged between 0 and V (50 257).
...@@ -202,7 +210,7 @@ The last step is to calculate the dot product between our multi-head attention m ...@@ -202,7 +210,7 @@ The last step is to calculate the dot product between our multi-head attention m
![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png) ![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png)
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are calculating the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are calculating the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T).
Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
......
Clone repository
  • Distributed Model
  • Fork Join Model
  • GPT2 Parallelization and porting
  • Metrics
  • Runtime and performances
  • Task Based Model
  • Various informations
  • _sidebar
  • Home