|
|
We use a total of 6 metrics to evaluate the model :
|
|
|
- Runtime per iteration (runtime to calculate forward + reset grads + backward + update)
|
|
|
- tokens/s : Number of computed tokens per second (=B*T/runtime_per_iteration)
|
|
|
- tokens/(s.cpus) : Number of computed tokens per second per CPU (=(token/s)/num_cpu))
|
|
|
- Loss : Training loss of the model
|
|
|
- MFU : Model flop utilization (application GFLOPS / MN5 GFLOPS)
|
|
|
|
|
|
As long as we are parallelizing the model, you should always get the same train losses as the model is deterministic.
|
|
|
The GPU uses the same metrics so we can compare the token/s metrics.
|
|
|
For now, it is not sure if we can relate on the MFU value. |
|
|
\ No newline at end of file |