tchatela · 04f05278
--- a/GPT2-Parallelization-and-porting.md
+++ b/GPT2-Parallelization-and-porting.md
+This wiki's goal is to link the GPT-2 model with its implementation in C. It will also describes the different strategies of parallelization performed, and how these strategies impact on the performance of the model.
+
+# Description of the model
+
+The Generative Pre-trained Transformer 2 (GPT-2) is a Large Language Model (LLM) introduced by OpenAI. It's particularity is to be composed of many Transformer layer, as you can see below.
+
+![Basic structure of GPT-2](https://www.researchgate.net/publication/373352176/figure/fig1/AS:11431281202501967@1698856108167/GPT-2-model-architecture-The-GPT-2-model-contains-N-Transformer-decoder-blocks-as-shown.ppm)
+
+However, the model implemented in [our reference code](https://github.com/karpathy/llm.c) is not strictly implementing this model, but a slightly different version. Here are the differences:
+ There are no dropout layers
+ The second residual forward is linking the output of the multi-head masked attention, and not the output of the normalization layer
+ The attention layer's function does not include the two linear layers as the sketch suggests. These two layers are calculated using matmul functions
+
+Therefore, here is a rectified sketch of the model implented :
\ No newline at end of file