|
|
|
Kernel test
|
|
|
|
-----------
|
|
|
|
The test is just a kernel execution in one node to test the execution times are acceptable to compute a quality coefficients. To launch the kernels script, the cluster queue manager, in that case SLURM, will be used.
|
|
|
|
|
|
|
|
It is required the `srun` command or `mpirun` bootstraping with SLURM. So, open the script `scripts/learning/helpers/kernels_executor.sh` and look at the function `launching()`. Check if the SLURM command is correct to launch the test correctly in your super computer.
|
|
|
|
|
|
|
|
If the kernels executions are between 60 and 120 seconds of elapsed time, it is considered enough to compute a quality coefficients. If not, it is recommended to tweak the kernels behaviour.
|
|
|
|
|
|
|
|
Kernel customization
|
|
|
|
--------------------
|
|
|
|
If after the launching of a learning phase kernel at P_STATE 1 the elapsed (in seconds) is between 60 and 120, then it is good quality kernel. In case some benchmarks are not between these times, you can increase or decrease the class letter at compilation time.
|
|
|
|
|
|
|
|
If no letter can adjust the kernels to your node, you can surf to every kernel configuration time and switch the values summarized in the following table:
|
|
|
|
|
|
|
|
| Kernel | File | Function | Var |
|
|
|
|
| -------- | ----------------------------------------- | ------------- | ----- |
|
|
|
|
| bt-mz | NPB3.3.1-MZ/NPB3.3-MZ-MPI/sys/setparams.c | write_bt_info | niter |
|
|
|
|
| lu-mz | NPB3.3.1-MZ/NPB3.3-MZ-MPI/sys/setparams.c | write_lu_info | itmax |
|
|
|
|
| sp-mz | NPB3.3.1-MZ/NPB3.3-MZ-MPI/sys/setparams.c | write_sp_info | niter |
|
|
|
|
| ep | NPB3.3.1/NPB3.3-MPI/sys/setparams.c | write_ep_info | m |
|
|
|
|
| lu | NPB3.3.1/NPB3.3-MPI/sys/setparams.c | write_lu_info | itmax |
|
|
|
|
| ua | NPB3.3.1/NPB3.3-OMP/sys/setparams.c | write_ua_info | niter |
|
|
|
|
|
|
|
|
Depending on your system you have to increase or decrease its value. As a reference, it is provided a table containing the letter for the script and the value of the variable for a couple of CPU architectures:
|
|
|
|
|
|
|
|
| Kernel | Haswell | Skylake |
|
|
|
|
| -------- | ------------ | ------------ |
|
|
|
|
| bt-mz | C / 200 | C / 600 |
|
|
|
|
| lu-mz | C / 250 | C / 800 |
|
|
|
|
| sp-mz | C / 800 | C / 2000 |
|
|
|
|
| ep | D / 36 | D / 37 |
|
|
|
|
| lu | C / 250 | C / 750 |
|
|
|
|
| ua | C / 200 | C / 200 |
|
|
|
|
|
|
|
|
For `dgemm`, you have to edit the file ‘dgemm_example.f’. Take a look to PARAMETER variable definition in the first line, which sets the size of the computing matrix. Increase or decrease that values equally depending if you want to add or subtract computing time.
|
|
|
|
|
|
|
|
Once the customization is done, you have to run again your customized kernels to complete the learning phase. Also, it is recommended to clean the customized kernels records of your database.
|
|
|
|
|
|
|
|
Step 4, coefficients computing
|
|
|
|
------------------------------
|
|
|
|
Once launched all the kernels at the different frequencies (or P_STATES), the coefficients have to be computed using the installed binary `/bin/compute_coefficients`.
|
|
|
|
|
|
|
|
This binary will compute the coefficients and also store the file in the location specified by the configuration file `ear.conf`. There is just one file per node, so the binary have to be run one time per node in a node of the same hardware architecture, because it checks the range of P_STATEs.
|
|
|
|
|
|
|
|
The path of the coefficients, the nominal frequency of the node an also de node name have to be passed to correctly compute the coefficients. In case the node name is not present, the binary will get it's the host name.
|
|
|
|
|
|
|
|
This is an example:
|
|
|
|
` ./compute_coefficients /etc/ear/coeffs 2400000 node1001`
|
|
|
|
|
|
|
|
Remember to load the EAR module, which specifies the location of the `ear.conf` configuration file.
|
|
|
|
|
|
|
|
Automatized kernels compilation script
|
|
|
|
--------------------------------------
|
|
|
|
A set of scripts are provided for speed up with minimum edition requirements. These files are placed in the `scripts/learning` folder in your EAR installation folder.
|
|
|
|
|
|
|
|
The compiling script is located in `scripts/learning/kernels_compile.sh`. Before execute it, you have to perform some adjustments:
|
|
|
|
1) Open `kernels_compile.sh` and look for these lines:
|
|
|
|
```
|
|
|
|
# Edit architecture values
|
|
|
|
export CORES=28
|
|
|
|
export SOCKETS=2
|
|
|
|
export CORES_PER_SOCKET=14
|
|
|
|
```
|
|
|
|
2) Update the following parameters:<br />
|
|
|
|
- **CORES**: the total number of cores in a single computing node.<br />
|
|
|
|
- **SOCKETS**: the total number of sockets in a single computing node.<br />
|
|
|
|
- **CORES_PER_SOCKET**: the total number of cores per socket in a single computing node.<br />
|
|
|
|
3) Launch the compiling phase by typing `./kernels_compile.sh` in your compile node.
|
|
|
|
|
|
|
|
Also you can easily customize your kernels by adjusting the script located in `scripts/learning/helpers/kernels_iterator.sh`. For example if you want to increase its execution time of a kernel compiled with class letter C, switch it by D. Or if you want to decrease the execution time of a kernel compiled with class letter B, switch the letter by A. Then compile and execute again.
|
|
|
|
|
|
|
|
You could see where you have to edit in the following example:
|
|
|
|
```
|
|
|
|
learning-phase lu-mpi C
|
|
|
|
learning-phase ep D
|
|
|
|
learning-phase bt-mz C
|
|
|
|
learning-phase sp-mz C
|
|
|
|
learning-phase lu-mz C
|
|
|
|
learning-phase ua C
|
|
|
|
learning-phase dgemm
|
|
|
|
learning-phase stream
|
|
|
|
```
|
|
|
|
|
|
|
|
As you can see, there are no class letters for `dgemm` or `stream` kernels. Stream is a well known benchmark and there is no need to manual modification because varies its behavior itself. For `dgemm` or for a class letter benchmark which doesn’t fit in your goals, it’s recommended to do a manual kernel modification.
|
|
|
|
|
|
|
|
Automatized kernels execution script
|
|
|
|
------------------------------------
|
|
|
|
Next to the kernels compilation script, the executing version is also provided. Having the kernels compiled, installed and tested, you are ready to execute the learning phase.
|
|
|
|
|
|
|
|
Before that, you have to perform some adjustments:
|
|
|
|
1) Open the script `scripts/learning/kernels_learn.sh`.
|
|
|
|
2) Look at these lines
|
|
|
|
```
|
|
|
|
# Edit architecture values
|
|
|
|
export CORES=28
|
|
|
|
export SOCKETS=2
|
|
|
|
export CORES_PER_SOCKET=14
|
|
|
|
|
|
|
|
# Edit learning phase parameters
|
|
|
|
export EAR_MIN_P_STATE=1
|
|
|
|
export EAR_MAX_P_STATE=6
|
|
|
|
```
|
|
|
|
3) Update the following parameters:<br />
|
|
|
|
- **CORES**: the total number of cores in a single computing node.<br />
|
|
|
|
- **SOCKETS**: the total number of sockets in a single computing node.<br />
|
|
|
|
- **CORES_PER_SOCKET**: the total number of cores per socket in a single computing node.<br />
|
|
|
|
- **EAR_MIN_P_STATE**: defines the maximum frequency to set during the learning phase. The default value is 1, meaning that the nominal frequency will be the maximum frequency that your cluster nodes will set. In the current version of EAR turbo support is not included.<br />
|
|
|
|
- **EAR_MAX_P_STATE**: defines the minimum frequency to test during the learning phase. If 6 is set and EAR_MIN_P_STATE is 1, it means that 6 frequencies will be set during the learning phase, from 1 to 6. This set of frequencies have to match with the set of frequencies that your cluster nodes are able to set during computing time.<br />
|
|
|
|
4) Edit the execution command located in `scripts/learning/helpers/kernels_executor.sh` in the function `launching_slurm()`. By default it will use the `srun` command, but you can switch it by other one, like `mpirun`. Just try to figure how to translate the written command to yours.
|
|
|
|
5) Execute the learning phase in all of your nodes by typing a command like: `./kernels_learn.sh <hostlist>`, passing a the path of a file containing the list of nodes where you want to perform the learning phase. An `sbatch` will be launched exclusively in every node, performing a `srun` series of the kernel in the same node.
|
|
|
|
6) Execute the coefficients compute binary by typing `./coeffs_compute.sh <hostlist>` in a node which shares the same architecture (or at least the P_STATEs list) of the nodes of the completed learning phase.
|
|
|
|
7) Check that there are the correct number of coefficients in the selected coefficients installation path. |
|
|
|
\ No newline at end of file |