|
|
This wiki's goal is to link the GPT-2 model with its implementation in C. It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
|
|
|
|
|
|
# Description of the model
|
|
|
|
|
|
The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM) introduced by OpenAI. It's particularity is to be composed of many Transformer layer, as you can see below.
|
|
|
|
|
|
![Basic structure of GPT-2](https://www.researchgate.net/publication/373352176/figure/fig1/AS:11431281202501967@1698856108167/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown.ppm)
|
|
|
|
|
|
However, the model implemented in [our reference code](https://github.com/karpathy/llm.c) is not strictly implementing this model, but a slightly different version. Here are the differences:
|
|
|
+ There are no dropout layers
|
|
|
+ The second residual forward is linking the output of the multi-head masked attention, and not the output of the normalization layer
|
|
|
+ The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions
|
|
|
|
|
|
Therefore, here is a rectified sketch of the model implented : |
|
|
\ No newline at end of file |