|
|
Setting up
|
|
|
----------
|
|
|
The test script launches all the kernels one time to check if its elapsed times are enough to compute a quality coefficients.
|
|
|
|
|
|
1) To launch the script, SLURM will be used, then it is required the `sbatch` and the commands `srun` or `mpirun` bootstraping SLURM. So, open the script `scripts/learning/helpers/kernels_executor.sh` and look at the function `launching()`. Check if the SLURM command is correct to run the test in your super computer. These are two examples of the `srun` and `mpirun` commands for a node of 40 CPUs:
|
|
|
|
|
|
`srun -N 1 -n 40 -J "bt-mz" -w node1001 --ear-policy=MONITORING_ONLY --ear-learning=1 /installation.path/bin/kernels/bt-mz.C.40`
|
|
|
|
|
|
`mpirun -n 40 -bootstrap slurm -bootstrap-exec-args="-J 'bt-mz' -w node1001 --ear-verbose=1 --ear-learning=1 /installation.path/bin/kernels/bt-mz.C.40`
|
|
|
|
|
|
2) Edit `scripts/learning/kernels_test.sh` and configure the cores/sockets options as in [compile guide](1-˗-Kernels-compile) setting up. After that, write a file with a list of nodes. For this test, just one node in that list is required, but if your cluster have multiple architectures, it is recommended to have one list of one node per architecture.
|
|
|
|
|
|
Kernel test
|
|
|
-----------
|
|
|
The test is just a kernel execution in one node to test the execution times are acceptable to compute a quality coefficients. To launch the kernels script, the cluster queue manager, in that case SLURM, will be used.
|
|
|
|
|
|
It is required the `srun` command or `mpirun` bootstraping with SLURM. So, open the script `scripts/learning/helpers/kernels_executor.sh` and look at the function `launching()`. Check if the SLURM command is correct to launch the test correctly in your super computer.
|
|
|
Run the test `scripts/learning/kernels_test.sh hostlist` for each test hostlist.
|
|
|
|
|
|
If the kernels executions are between 60 and 120 seconds of elapsed time, it is considered enough to compute a quality coefficients. If not, it is recommended to tweak the kernels behaviour.
|
|
|
A run time between 60 and 120 seconds per kernel is considered enough. If not, it is recommended to tweak the kernels behaviour to increase/decreas the running time.
|
|
|
|
|
|
Kernel customization
|
|
|
Fast kernel tweaking
|
|
|
--------------------
|
|
|
If after the launching of a learning phase kernel at P_STATE 1 the elapsed (in seconds) is between 60 and 120, then it is good quality kernel. In case some benchmarks are not between these times, you can increase or decrease the class letter at compilation time.
|
|
|
You can easily customize your kernels by adjusting the script located in `scripts/learning/helpers/kernels_iterator.sh`. For example if you want to increase the execution time of a kernel compiled with class letter C, switch it by D. Or decrease the execution time of a kernel compiled with class letter B switching the letter by A.
|
|
|
|
|
|
If no letter can adjust the kernels to your node, you can surf to every kernel configuration time and switch the values summarized in the following table:
|
|
|
You could see where to edit in the following example:
|
|
|
```
|
|
|
learning-phase lu-mpi C
|
|
|
learning-phase ep D
|
|
|
learning-phase bt-mz C
|
|
|
learning-phase sp-mz C
|
|
|
learning-phase lu-mz C
|
|
|
learning-phase ua C
|
|
|
learning-phase dgemm
|
|
|
learning-phase stream
|
|
|
```
|
|
|
|
|
|
There are no class letters for `dgemm` or `stream` kernels. `stream` is a well known benchmark and there is no need to manual modification because varies its behavior by itself. For `dgemm` or for a class letter benchmark which doesn’t fit in your goals, it’s recommended to do a manual kernel modification.
|
|
|
|
|
|
Finally [compile](1-˗-Kernels-compile) and test again.
|
|
|
|
|
|
Deep kernel customization
|
|
|
-------------------------
|
|
|
In addition to the fast tweaking, if no letter can adjust the kernels to your node properties, you can surf the kernels configuration file and edit the values summarized in the following table:
|
|
|
|
|
|
| Kernel | File | Function | Var |
|
|
|
| -------- | ----------------------------------------- | ------------- | ----- |
|
... | ... | @@ -21,7 +49,7 @@ If no letter can adjust the kernels to your node, you can surf to every kernel c |
|
|
| lu | NPB3.3.1/NPB3.3-MPI/sys/setparams.c | write_lu_info | itmax |
|
|
|
| ua | NPB3.3.1/NPB3.3-OMP/sys/setparams.c | write_ua_info | niter |
|
|
|
|
|
|
Depending on your system you have to increase or decrease its value. As a reference, it is provided a table containing the letter for the script and the value of the variable for a couple of CPU architectures:
|
|
|
Depending on your system increase or decrease that values. As a reference, it is provided a table containing the script letter and the value of the variable for a couple of CPU architectures:
|
|
|
|
|
|
| Kernel | Haswell | Skylake |
|
|
|
| -------- | ------------ | ------------ |
|
... | ... | @@ -32,81 +60,6 @@ Depending on your system you have to increase or decrease its value. As a refere |
|
|
| lu | C / 250 | C / 750 |
|
|
|
| ua | C / 200 | C / 200 |
|
|
|
|
|
|
For `dgemm`, you have to edit the file ‘dgemm_example.f’. Take a look to PARAMETER variable definition in the first line, which sets the size of the computing matrix. Increase or decrease that values equally depending if you want to add or subtract computing time.
|
|
|
|
|
|
Once the customization is done, you have to run again your customized kernels to complete the learning phase. Also, it is recommended to clean the customized kernels records of your database.
|
|
|
|
|
|
Step 4, coefficients computing
|
|
|
------------------------------
|
|
|
Once launched all the kernels at the different frequencies (or P_STATES), the coefficients have to be computed using the installed binary `/bin/compute_coefficients`.
|
|
|
|
|
|
This binary will compute the coefficients and also store the file in the location specified by the configuration file `ear.conf`. There is just one file per node, so the binary have to be run one time per node in a node of the same hardware architecture, because it checks the range of P_STATEs.
|
|
|
|
|
|
The path of the coefficients, the nominal frequency of the node an also de node name have to be passed to correctly compute the coefficients. In case the node name is not present, the binary will get it's the host name.
|
|
|
For `dgemm`, edit the file ‘dgemm_example.f’. Take a look to the PARAMETER variable definition in the first line, it sets the size of the computing matrix. Increase or decrease that values equally depending if you want to add or subtract computing time.
|
|
|
|
|
|
This is an example:
|
|
|
` ./compute_coefficients /etc/ear/coeffs 2400000 node1001`
|
|
|
|
|
|
Remember to load the EAR module, which specifies the location of the `ear.conf` configuration file.
|
|
|
|
|
|
Automatized kernels compilation script
|
|
|
--------------------------------------
|
|
|
A set of scripts are provided for speed up with minimum edition requirements. These files are placed in the `scripts/learning` folder in your EAR installation folder.
|
|
|
|
|
|
The compiling script is located in `scripts/learning/kernels_compile.sh`. Before execute it, you have to perform some adjustments:
|
|
|
1) Open `kernels_compile.sh` and look for these lines:
|
|
|
```
|
|
|
# Edit architecture values
|
|
|
export CORES=28
|
|
|
export SOCKETS=2
|
|
|
export CORES_PER_SOCKET=14
|
|
|
```
|
|
|
2) Update the following parameters:<br />
|
|
|
- **CORES**: the total number of cores in a single computing node.<br />
|
|
|
- **SOCKETS**: the total number of sockets in a single computing node.<br />
|
|
|
- **CORES_PER_SOCKET**: the total number of cores per socket in a single computing node.<br />
|
|
|
3) Launch the compiling phase by typing `./kernels_compile.sh` in your compile node.
|
|
|
|
|
|
Also you can easily customize your kernels by adjusting the script located in `scripts/learning/helpers/kernels_iterator.sh`. For example if you want to increase its execution time of a kernel compiled with class letter C, switch it by D. Or if you want to decrease the execution time of a kernel compiled with class letter B, switch the letter by A. Then compile and execute again.
|
|
|
|
|
|
You could see where you have to edit in the following example:
|
|
|
```
|
|
|
learning-phase lu-mpi C
|
|
|
learning-phase ep D
|
|
|
learning-phase bt-mz C
|
|
|
learning-phase sp-mz C
|
|
|
learning-phase lu-mz C
|
|
|
learning-phase ua C
|
|
|
learning-phase dgemm
|
|
|
learning-phase stream
|
|
|
```
|
|
|
|
|
|
As you can see, there are no class letters for `dgemm` or `stream` kernels. Stream is a well known benchmark and there is no need to manual modification because varies its behavior itself. For `dgemm` or for a class letter benchmark which doesn’t fit in your goals, it’s recommended to do a manual kernel modification.
|
|
|
|
|
|
Automatized kernels execution script
|
|
|
------------------------------------
|
|
|
Next to the kernels compilation script, the executing version is also provided. Having the kernels compiled, installed and tested, you are ready to execute the learning phase.
|
|
|
|
|
|
Before that, you have to perform some adjustments:
|
|
|
1) Open the script `scripts/learning/kernels_learn.sh`.
|
|
|
2) Look at these lines
|
|
|
```
|
|
|
# Edit architecture values
|
|
|
export CORES=28
|
|
|
export SOCKETS=2
|
|
|
export CORES_PER_SOCKET=14
|
|
|
|
|
|
# Edit learning phase parameters
|
|
|
export EAR_MIN_P_STATE=1
|
|
|
export EAR_MAX_P_STATE=6
|
|
|
```
|
|
|
3) Update the following parameters:<br />
|
|
|
- **CORES**: the total number of cores in a single computing node.<br />
|
|
|
- **SOCKETS**: the total number of sockets in a single computing node.<br />
|
|
|
- **CORES_PER_SOCKET**: the total number of cores per socket in a single computing node.<br />
|
|
|
- **EAR_MIN_P_STATE**: defines the maximum frequency to set during the learning phase. The default value is 1, meaning that the nominal frequency will be the maximum frequency that your cluster nodes will set. In the current version of EAR turbo support is not included.<br />
|
|
|
- **EAR_MAX_P_STATE**: defines the minimum frequency to test during the learning phase. If 6 is set and EAR_MIN_P_STATE is 1, it means that 6 frequencies will be set during the learning phase, from 1 to 6. This set of frequencies have to match with the set of frequencies that your cluster nodes are able to set during computing time.<br />
|
|
|
4) Edit the execution command located in `scripts/learning/helpers/kernels_executor.sh` in the function `launching_slurm()`. By default it will use the `srun` command, but you can switch it by other one, like `mpirun`. Just try to figure how to translate the written command to yours.
|
|
|
5) Execute the learning phase in all of your nodes by typing a command like: `./kernels_learn.sh <hostlist>`, passing a the path of a file containing the list of nodes where you want to perform the learning phase. An `sbatch` will be launched exclusively in every node, performing a `srun` series of the kernel in the same node.
|
|
|
6) Execute the coefficients compute binary by typing `./coeffs_compute.sh <hostlist>` in a node which shares the same architecture (or at least the P_STATEs list) of the nodes of the completed learning phase.
|
|
|
7) Check that there are the correct number of coefficients in the selected coefficients installation path. |
|
|
\ No newline at end of file |
|
|
Once the customization is done, you have to [compile](1-˗-Kernels-compile) and test again. |
|
|
\ No newline at end of file |