Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • GPT2 Parallelization and porting

GPT2 Parallelization and porting · Changes

Page history
Update GPT2 Parallelization and porting authored Jun 21, 2024 by tchatela's avatar tchatela
Show whitespace changes
Inline Side-by-side
GPT2-Parallelization-and-porting.md
View page @ 62371cc5
......@@ -15,7 +15,7 @@ However, the model implemented in [the GitHub repository](https://github.com/kar
+ The model's last layer is softmax
+ The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions
Therefore, here is a rectified sketch of the model implented : ![Adapted structure of GPT-2](uploads/3f9755fa7c4882bafcc5698cc2cfe636/gpt2-cleaned.png)
Therefore, here is a rectified sketch of the model implemented : ![Adapted structure of GPT-2](uploads/3f9755fa7c4882bafcc5698cc2cfe636/gpt2-cleaned.png)
## Model's size
......@@ -98,12 +98,12 @@ First, let's focus on the shape of the data we receive. For each batch, we recei
The first layer's role is to encode each token into an array, called channel, that will abstract its meaning and its position into the sentence. Therefore, the layer will encode each token into a channel of size (C), making an output matrix of shape (B, T, C).
![Token_embedding_1<span data-escaped-char><span data-escaped-char>\_</span></span>](uploads/ffb83174e0e50930f6533b52be569858/Token_embedding_1%3Cspan%20data-escaped-char%3E<span%20data-escaped-char>\_</span>%3C/span%3E.png)
![Token-embedding](uploads/e36c23f83603a628b5e849e49fec41fb/Token-embedding.png)
This layer will depend on two arrays of parameters:
+ **wte (V,C)** : To each token value, these weights assign a channel of float values. This array is retrieved at wte\[token_value)
+ **wpe** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
+ **wpe (V,C)** : To each token position into the sequence array (the array of size T), this weights assign channel. It means that for a token at position t in the array, an array of size C will be retrieved at wpe\[t\]
**Conclusion** : For one token, the array generated will be wte\[token_value\] + wpe\[t\]
......@@ -143,33 +143,19 @@ Let's take Q from one token A (an array of shape C) and K from another token B (
Finishing later
#### In the code
### Residual layer
The purpose of this layer is to increase the performance of the gradient back propagation. To do so, we allow old data matrices to reintroduce in the current matrices, by adding them together.
### Gelu layer
This layer simply applies the Gelu function to all elements of the given input (of shaoe B,T,4C). The Gelu function is drawn just below. ![gelu](uploads/0c248ab94914bfc223093046744b59a8/gelu.png)
This layer simply applies the Gelu function to all elements of the given input (of shape B,T,4C). The Gelu function is drawn just below. ![gelu](uploads/0c248ab94914bfc223093046744b59a8/gelu.png)
### Softmax layer
For this layer, input and output is of shape (B,T,Vp).
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax funtion, we end up with an output of shape (B,T,Vp). To each token, a noralized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
## Model's data
Introducing the main variables (V, NH, L ...) Describing the matrices Short
## Advanced description
Explain the functions in detail. Inputs, outputs and variables
## Variable dictionary
Dictionary of the model's parameters
This layer is the last one of the model. Its purpose is to output understandable data. By applying a softmax function, we end up with an output of shape (B,T,Vp). To each token, a normalized array of size Vp is given. The values of this array describes the probability of a token being a specific sequence of characters from the vocabulary. Note that any value from this array which index is greater than V will be 0, as we have padded the vocabulary size for data alignment and not for any logic purpose.
# Model performances
......
Clone repository
  • Distributed Model
  • Fork Join Model
  • GPT2 Parallelization and porting
  • Metrics
  • Runtime and performances
  • Task Based Model
  • Various informations
  • _sidebar
  • Home