Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • Task Based Model

Task Based Model · Changes

Page history
Update Task Based Model authored Jul 05, 2024 by tchatela's avatar tchatela
Hide whitespace changes
Inline Side-by-side
Task-Based-Model.md
View page @ 96dc6757
......@@ -10,14 +10,14 @@ First, we need to analyze what parts of the model can be overlapped. Therefore,
## Data dependency
On a general point of view, each layer of the forward needs to have its weights and bias updated before being computed.
On a general point of view, each layer of the forward needs to have its weights and bias updated before being computed.
As the backward pass is generating the gradient values (necessary to update the weights) in the reverse order of the forward pass, we have to wait until the end of the backward pass to begin the forward pass. In fact, the last layer computed in the backward pass is the encoder, which is the first layer of the forward.
As the backward pass is generating the gradient values (necessary to update the weights) in the reverse order of the forward pass, we have to wait until the end of the backward pass to begin the forward pass. In fact, the last layer computed in the backward pass is the encoder, which is the first layer of the forward.
The same logic applies for the backward pass, and we also need to wait the forward pass to be finished to begin our backward pass.
The same logic applies for the backward pass, and we also need to wait the forward pass to be finished to begin our backward pass.
With this general / too wide view of the model, it does not seems like we will be able to make overlapping tasks. Therefore, we should go more in depth to apprehend the model.
With this general / too wide view of the model, it does not seems like we will be able to make overlapping tasks. Therefore, we should go more in depth to apprehend the model.
During a forward pass, nearly each of the calculations made are channel-independent, as there is only the attention layer that is using multiple channels a one calculation. Therefore, we can set up tasks with k - channels as an input data, k a divider of T in \[1, T\], allowing us to overlap two different layers using different sets of k-channels. However, the attention layer is using all the tokens of a sentence for its computation. Thus, it is mandatory to wait for all the k-channels of a sentence before entering this layer (this is also why we choose k as a divider of T in \[1, T\]). But as the attention layer is only computing one sentence at once, we can still compute k-channels sets from other sentences at the same time as the attention layer is running.
During a backward pass, the principle of k-channels is a bit different, as we are not computing tokens anymore, but weights. However, we can still see
\ No newline at end of file
During a backward pass, the principle of k-channels is a bit different, as the weights and biases do not have a shape depending of T. Basically, we can see this as k=1. However this is not problem
\ No newline at end of file
Clone repository

GPT2 Parallelization and Porting

  • Model Description
  • Runtime and Performances
  • Improvements
  • Traces
  • Task Based Model