Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • L llm.c - GPT2
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Terraform modules
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • tchatela
  • llm.c - GPT2
  • Wiki
  • GPT2 Parallelization and porting

GPT2 Parallelization and porting · Changes

Page history
Update GPT2 Parallelization and porting authored Jun 25, 2024 by tchatela's avatar tchatela
Hide whitespace changes
Inline Side-by-side
GPT2-Parallelization-and-porting.md
View page @ 4748c372
......@@ -177,7 +177,7 @@ The principle of multi-head attention is to divide each Q,K,V into _head_s. As y
The last step is to do the dot product between our multi-head attention matrix and the value vectors.
![Multi-headAttention7](uploads/6caf9040941d8c182941d70d7fe215c5/Multi-headAttention7.png)
![Multi-headAttention11](uploads/cab96fcaece49c827f700814ce7300af/Multi-headAttention11.png)
To do this, we still need to take each head separately. We then realize the dot product between one head of the multi-head attention matrix, and the matching value head vector of each attention head vector. Furthermore, as we are doing the dot product between a array of size _1_ (the attention value for one head), and an array of size _hs_ (the value sub array for one head), the output array will be of size _hs_. Also, as we are doing this for _NH_ heads, the global matrix at the end will be of size (NH\*hs, T, T) = (C, T, T). Moreover, we need to sum each row to itself, making an output matrix of shape (C, T). Also, remember that we are processing batches, so this process applies to every token sequence of each batch, making the output of the attention layer a matrix of shape (B, C, T). Finally, if we forget the representation we used for convenience, the actual output shape is (B, T, C) (we have just flipped two axis).
......
Clone repository
  • Distributed Model
  • Fork Join Model
  • GPT2 Parallelization and porting
  • Metrics
  • Runtime and performances
  • Task Based Model
  • Various informations
  • _sidebar
  • Home