Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • GPT2 Parallelization and porting

GPT2 Parallelization and porting · Changes

Page history
Update GPT2 Parallelization and porting authored Jun 21, 2024 by tchatela's avatar tchatela
Hide whitespace changes
Inline Side-by-side
GPT2-Parallelization-and-porting.md
View page @ 62371cc5
...@@ -15,7 +15,7 @@ However, the model implemented in [the GitHub repository](https://github.com/kar ...@@ -15,7 +15,7 @@ However, the model implemented in [the GitHub repository](https://github.com/kar
+ The model's last layer is softmax + The model's last layer is softmax
+ The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions + The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions
Therefore, here is a rectified sketch of the model implented : ![Adapted structure of GPT-2](uploads/3f9755fa7c4882bafcc5698cc2cfe636/gpt2-cleaned.png) Therefore, here is a rectified sketch of the model implemented : ![Adapted structure of GPT-2](uploads/3f9755fa7c4882bafcc5698cc2cfe636/gpt2-cleaned.png)
## Model's size ## Model's size
...@@ -78,19 +78,19 @@ Here is the order in which the functions are called in the model. You will find ...@@ -78,19 +78,19 @@ Here is the order in which the functions are called in the model. You will find
1. **function_name** _input_size -> output_size_ // _parameter1 parameter2..._ 1. **function_name** _input_size -> output_size_ // _parameter1 parameter2..._
2. **encoder_forward** _(B,T) -> (B,T,C)_ // _wte wpe_ 2. **encoder_forward** _(B,T) -> (B,T,C)_ // _wte wpe_
3. transformer layer repeated L times, composed of : 3. transformer layer repeated L times, composed of :
1. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln1w ln1b_ 1. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln1w ln1b_
2. **matmul_forward** _(B,T,C) -> (B,T,3C)_ // _qkvw qkvb_ 2. **matmul_forward** _(B,T,C) -> (B,T,3C)_ // _qkvw qkvb_
3. **attention_forward** _(B,T,3C) -> (B,T,C)_ 3. **attention_forward** _(B,T,3C) -> (B,T,C)_
4. **matmul_forward** _(B,T,C) -> (B,T,C)_ // _attprojw attprojb_ 4. **matmul_forward** _(B,T,C) -> (B,T,C)_ // _attprojw attprojb_
5. **residual_forward** _(B,T,C) + (B,T,C) -> (B,T,C)_ 5. **residual_forward** _(B,T,C) + (B,T,C) -> (B,T,C)_
6. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln2w ln2b_ 6. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _ln2w ln2b_
7. **matmul_forward** _(B,T,C) -> (B,T,4C)_ // _fcw fcb_ 7. **matmul_forward** _(B,T,C) -> (B,T,4C)_ // _fcw fcb_
8. **gelu_forward** _(B,T,4C) -> (B,T,4C)_ 8. **gelu_forward** _(B,T,4C) -> (B,T,4C)_
9. **matmul_forward** _(B,T,4C) -> (B,T,C)_ // _fcprojw fcprojb_ 9. **matmul_forward** _(B,T,4C) -> (B,T,C)_ // _fcprojw fcprojb_
10. **residual_forward** (_B,T,C) + (B,T,C) -> (B,T,C)_ 10. **residual_forward** (_B,T,C) + (B,T,C) -> (B,T,C)_
4. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _lnfw lnfb_ 4. **layernorm_forward** _(B,T,C) -> (B,T,C)_ // _lnfw lnfb_
5. **matmul_forward** _(B,T,C) -> (B,T,Vp)_ // _wte_ 5. **matmul_forward** _(B,T,C) -> (B,T,Vp)_ // _wte_
6. **softmax_forward** _(B,T,Vp) -> (B,T,Vp)_ 6. **softmax_forward** _(B,T,Vp) -> (B,T,Vp)_
### Token embedding ### Token embedding
...@@ -98,12 +98,12 @@ First, let's focus on the shape of the data we receive. For each batch, we recei ...@@ -98,12 +98,12 @@ First, let's focus on the shape of the data we receive. For each batch, we recei
The first layer's role is to encode each token into an array, called channel, that will abstract its meaning and its position into the sentence. Therefore, the layer will encode each token into a channel of size (C), making an output matrix of shape (B, T, C). The first layer's role is to encode each token into an array, called channel, that will abstract its meaning and its position into the sentence. Therefore, the layer will encode each token into a channel of size (C), making an output matrix of shape (B, T, C).
![Token_embedding_1<span data-escaped-char><span data-escaped-char>\_</span></span>](uploads/ffb83174e0e50930f6533b52be569858/Token_embedding_1%3Cspan%20data-escaped-char%3E<span%20data-escaped-char>\_</span>%3C/span%3E.png) ![Token-embedding](uploads/e36c23f83603a628b5e849e49fec41fb/Token-embedding.png)
This layer will depend on two arrays of parameters: This layer will depend on two arrays of parameters:
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value) + **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value)
+ **wpe** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\] + **wpe (V,C)** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
**Conclusion** : For one token, the array generated will be wte\[token_value\] + wpe\[t\] **Conclusion** : For one token, the array generated will be wte\[token_value\] + wpe\[t\]
...@@ -143,33 +143,19 @@ Let's take Q from one token A (an array of shape C) and K from another token B ( ...@@ -143,33 +143,19 @@ Let's take Q from one token A (an array of shape C) and K from another token B (
Finishing later Finishing later
#### In the code
### Residual layer ### Residual layer
The purpose of this layer is to increase the performance of the gradient back propagation. To do so, we allow old data matrices to reintroduce in the current matrices, by adding them together. The purpose of this layer is to increase the performance of the gradient back propagation. To do so, we allow old data matrices to reintroduce in the current matrices, by adding them together.
### Gelu layer ### Gelu layer
This layer simply applies the Gelu function to all elements of the given input (of shaoe B,T,4C). The Gelu function is drawn just below. ![gelu](uploads/0c248ab94914bfc223093046744b59a8/gelu.png) This layer simply applies the Gelu function to all elements of the given input (of shape B,T,4C). The Gelu function is drawn just below. ![gelu](uploads/0c248ab94914bfc223093046744b59a8/gelu.png)
### Softmax layer ### Softmax layer
For this layer, input and output is of shape (B,T,Vp). For this layer, input and output is of shape (B,T,Vp).
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax funtion, we end up with an output of shape (B,T,Vp). To each token, a noralized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose. This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
## Model's data
Introducing the main variables (V, NH, L ...) Describing the matrices Short
## Advanced description
Explain the functions in detail. Inputs, outputs and variables
## Variable dictionary
Dictionary of the model's parameters
# Model performances # Model performances
......
Clone repository
  • Distributed Model
  • Fork Join Model
  • GPT2 Parallelization and porting
  • Metrics
  • Runtime and performances
  • Task Based Model
  • Various informations
  • _sidebar
  • Home