|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
# Introduction
|
|
|
|
|
|
|
|
EAR offers some environment variables in order to provide users the opportunity to
|
|
|
|
tune or request some of EAR features.
|
|
|
|
They must be exported before the job submission, e.g., in the batch script.
|
|
|
|
|
|
|
|
The current EAR version has support for [SLURM](https://slurm.schedmd.com/),
|
|
|
|
[PBS](https://www.altair.com/pbs-professional/) and [OAR](https://oar.imag.fr/start)
|
|
|
|
batch schedulers.
|
|
|
|
In SLURM systems the scheduler may filter environment variables not prefixed with *SLURM_* character set (this happens when the batch script is submitted purging all environment variables to work in a clean environment).
|
|
|
|
For that reason, the first design of EAR environment variables was to have variable names with the form *SLURM_*\<variable_name\>.
|
|
|
|
|
|
|
|
Now that EAR has support for other batch schedulers, and in order to maintain the coherency of environment variables names, below environment variables need the prefix of the scheduler used on the system the job is submitted on, plus an underscore.
|
|
|
|
For example, in SLURM systems, the environment variable presented as `EAR_LOADER_APPLICATION` must be exported as `SLURM_EAR_LOADER_APPLICATION` in the submission batch script. In an OAR installed system, this variable would be exported as `OAR_EAR_LOADER_APPLICATION`.
|
|
|
|
This design may only have a real effect on SLURM systems, but it makes it easier for the development team to provide support for multiple batch schedulers.
|
|
|
|
|
|
|
|
All examples showing the usage of below environment variables assume a system using SLURM.
|
|
|
|
|
|
|
|
# Loading EAR Library
|
|
|
|
|
|
|
|
## EAR_LOADER_APPLICATION
|
|
|
|
|
|
|
|
Rules the EAR Loader to load the EAR Library for a specific application that does not follow any of the current programming models (or maybe a sequential app) supported by EAR.
|
|
|
|
Your system must have installed the non-MPI version of the Library (ask your system administrator).
|
|
|
|
|
|
|
|
The value of the environment variable must coincide with the job name of the application you want to launch with EAR.
|
|
|
|
If you don’t provide it, the EAR Loader will compare it against the executable name. For example:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
|
|
export SLURM_EAR_LOADER_APPLICATION=my_job_name
|
|
|
|
|
|
|
|
srun --ntasks 1 --job-name=my_job_name ./my_exec_file
|
|
|
|
```
|
|
|
|
|
|
|
|
See the [Use cases](User-guide#use-cases) section to read more information about how to run jobs with EAR.
|
|
|
|
|
|
|
|
## EAR_LOAD_MPI_VERSION
|
|
|
|
|
|
|
|
Forces to load a specific MPI version of the EAR Library.
|
|
|
|
This is needed, for example, when you want to load the EAR Library for Python + MPI applications, where the Loader is not able to detect the MPI implementation the application is going to use.
|
|
|
|
Accepted values are either *intel* or *open mpi*.
|
|
|
|
The following example runs Tensorflow 1 benchmarks for several convulational neural networks with EAR.
|
|
|
|
It can be downloaded from Tensorflow benchmarks [repository](https://github.com/tensorflow/benchmarks).
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
|
|
#SBATCH --job-name=TensorFlow
|
|
|
|
#SBATCH -N 8
|
|
|
|
#SBATCH --ntasks-per-node=4
|
|
|
|
#SBATCH --cpus-per-task=18
|
|
|
|
|
|
|
|
# Specific modules here
|
|
|
|
# ...
|
|
|
|
|
|
|
|
export SLURM_LOAD_MPI_VERSION="open mpi"
|
|
|
|
|
|
|
|
srun --ear-policy=min_time \
|
|
|
|
python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
|
|
|
|
... more application options
|
|
|
|
```
|
|
|
|
|
|
|
|
See the [Use cases](User-guide#use-cases) section to read more information about how to run jobs with EAR.
|
|
|
|
|
|
|
|
# Report plug-ins
|
|
|
|
|
|
|
|
## EAR_REPORT_ADD
|
|
|
|
|
|
|
|
Specify a report plug-in to be loaded. The value must be a shared object file, and it must be located at `$EAR_INSTALL_PATH/lib/plugins/report` or at the path from where the job was launched.
|
|
|
|
Alternatively, you can provide the full path (absolute or relative) of the report plug-in.
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
|
|
export SLURM_EAR_REPORT_ADD=my_report_plugin.so
|
|
|
|
|
|
|
|
srun -n 10 my_mpi_app
|
|
|
|
```
|
|
|
|
|
|
|
|
# Verbosity
|
|
|
|
|
|
|
|
## EARL_VERBOSE_PATH
|
|
|
|
|
|
|
|
Specify a path to create a file (one per node involved in a job) where to print messages from the EAR Library.
|
|
|
|
This is useful when you run a job in multiple nodes, as EAR verbose information for each of them can result in lots of messages mixed at stderr (EAR messages default channel).
|
|
|
|
Also, there are applications that print information in both stdout and stderr, so maybe a user wants to have information separated.
|
|
|
|
|
|
|
|
If the path does not exist, EAR will create it.
|
|
|
|
The format of generated files names is `earl_log.<node_rank>.<local_rank>.<job_step>.<job_id>`, where the *node_rank* is an integer set by EAR from 0 to *n_nodes - 1* involved in the job, and it indicates to which node the information belongs to.
|
|
|
|
The local rank is an arbitrary rank set by EAR of a process in the node (from 0 to *n_procceses_in_node - 1*).
|
|
|
|
It indicates which process is printing messages to the files, and it will be always the first one indexed, i.e., 0.
|
|
|
|
Finally, the *job_step* and *job_id* are fields showing information about the job corresponding to the execution from where messages were generated.
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
|
|
#SBATCH -j my_job_name
|
|
|
|
#SBATCH -N 2
|
|
|
|
#SBATCH -n 96
|
|
|
|
|
|
|
|
export SLURM_EARL_VERBOSE_PATH=ear_logs_dir_name
|
|
|
|
export I_MPI_HYDRA_BOOTSTRAP=slurm
|
|
|
|
export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=”--ear-verbose=1”
|
|
|
|
|
|
|
|
mpirun -np 96 -ppn 48 my_app
|
|
|
|
```
|
|
|
|
|
|
|
|
After the above job example completion, in the same directory where the application was submitted, there will be a directory called *ear_logs_dir_name* with two files, i.e., one for each node, called *earl_logs.0.0.<job_step>.<job_id>* and *earl_logs.1.0.<job_step>.<job_id>*, respectively.
|
|
|
|
|
|
|
|
# Frequency management
|
|
|
|
|
|
|
|
## EAR_GPU_DEF_FREQ
|
|
|
|
|
|
|
|
Set a GPU frequency (in kHz) to be fixed while your job is running.
|
|
|
|
The same frequency is set for all GPUs used by the job.
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
|
|
#SBATCH -J gromacs-cuda
|
|
|
|
#SBATCH -N 1
|
|
|
|
|
|
|
|
export I_MPI_PIN=1
|
|
|
|
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
|
|
|
|
|
|
|
|
input_path=/hpc/appl/biology/GROMACS/examples
|
|
|
|
input_file=ion_channel.tpr
|
|
|
|
GROMACS_INPUT=$input_path/$input_file
|
|
|
|
|
|
|
|
export SLURM_EAR_GPU_DEF_FREQ=1440000
|
|
|
|
|
|
|
|
srun --cpu-bind=core --ear-policy=min_energy gmx_mpi mdrun \
|
|
|
|
-s $GROMACS_INPUT -noconfout -ntomp 1
|
|
|
|
```
|
|
|
|
|
|
|
|
## EAR_JOB_EXCLUSIVE_MODE
|
|
|
|
|
|
|
|
Indicate whether the job will run in a node exclusively (non-zero value).
|
|
|
|
EAR will reduce the CPU frequency of those cores not used by the job.
|
|
|
|
This feature explodes a very easy vector of power saving.
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
#SBATCH -N 1
|
|
|
|
#SBATCH -n 64
|
|
|
|
#SBATCH --cpus-per-task=2
|
|
|
|
#SBATCH --exclusive
|
|
|
|
|
|
|
|
export SLURM_EAR_JOB_EXCLUSIVE_MODE=1
|
|
|
|
|
|
|
|
srun -n 10 --ear=on ./mpi_mpi_app
|
|
|
|
```
|
|
|
|
|
|
|
|
## Controlling Uncore/Infinity Fabric frequency
|
|
|
|
|
|
|
|
EARL offers the possibility to control the Integrated Memory Controller (IMC) for Intel(R)
|
|
|
|
architectures and Infinity Fabric (IF) for AMD architectures.
|
|
|
|
On this page we will use the term *uncore* to refer both of them.
|
|
|
|
Environment variables related to uncore control covers [policy specific settings](#ear_set_imcfreq) or
|
|
|
|
the chance for a user to [fix it](#ear_max_imcfreq-and-ear_min_imcfreq) during an entire job.
|
|
|
|
|
|
|
|
### EAR_SET_IMCFREQ
|
|
|
|
|
|
|
|
Enables/disables EAR's [eUFS](Home#publications) feature.
|
|
|
|
Type `ear-info` to see whehter eUFS is enabled by default.
|
|
|
|
|
|
|
|
You can control eUFS' maximum permitted time penalty by exporting `EAR_POLICY_IMC_TH`,
|
|
|
|
which is a float indicating the threshold value that prevents the policy to reduce so much the uncore frequency,
|
|
|
|
possible leading to considerable performance penalty.
|
|
|
|
|
|
|
|
Below example enables eUFS with a penalty threshold of 3.5%:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
...
|
|
|
|
|
|
|
|
export SLURM_EAR_SET_IMCFREQ=1
|
|
|
|
export SLURM_EAR_POLICY_IMC_TH=0.035
|
|
|
|
...
|
|
|
|
|
|
|
|
srun [...] my_app
|
|
|
|
```
|
|
|
|
|
|
|
|
### EAR_MAX_IMCFREQ and EAR_MIN_IMCFREQ
|
|
|
|
|
|
|
|
Set the maximum and minimum values (in kHz) at which *uncore* frequency should be.
|
|
|
|
Two variables were designed because Intel(R) architectures let to set a range of
|
|
|
|
frequencies that limits its internal UFS mechanism.
|
|
|
|
If you set both variables with different values, the minimum one will be set.
|
|
|
|
|
|
|
|
Below example shows a job execution fixing the uncore frequency at 2.0GHz:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
...
|
|
|
|
|
|
|
|
export SLURM_EAR_MAX_IMCFREQ=2000000
|
|
|
|
export SLURM_EAR_MIN_IMCFREQ=2000000
|
|
|
|
...
|
|
|
|
|
|
|
|
srun [...] my_app
|
|
|
|
```
|
|
|
|
|
|
|
|
## Load Balancing
|
|
|
|
|
|
|
|
By default, EAR policies try to set the best CPU (and uncore, if [enabled](#controlling-uncore-infinity-fabric-frequency)) frequency according to node grain metrics.
|
|
|
|
This behaviour can be changed telling EAR to detect and deal with unbalanced workloads, i.e., there is no equity between processes regarding their MPI/computational activity.
|
|
|
|
|
|
|
|
When EAR detects such behaviour, policies slightly modify its way of CPU frequency selection
|
|
|
|
by setting a different frequency for each process' cores according how far it is from the critical path.
|
|
|
|
Please, contact with [ear-support@bsc.es](mailto:ear-support@bsc.es) if you want more details about how it works.
|
|
|
|
|
|
|
|
> A correct CPU binding it's required to get the most benefit of this feature. Check the documentation of your application programming model/vendor/flavour or yur system batch scheduler.
|
|
|
|
|
|
|
|
### EAR_LOAD_BALANCE
|
|
|
|
|
|
|
|
Enables/Disables EAR's Load Balance strategy in energy policies.
|
|
|
|
Type `ear-info` to see whether this feature is enabled by default.
|
|
|
|
|
|
|
|
Load unbalance detection algorithm is based on [POP-CoE](https://pop-coe.eu/node/69)'s Load Balance Efficiency metric, which is computed as the ratio between average useful computation time (across all processes) and maximum useful computation time (also across all processes).
|
|
|
|
By default (if `EAR_LOAD_BALANCE` is enabled), a node load balance efficiency below **0.8** will trigger EAR's Load Balancing algorithm.
|
|
|
|
This threshold value can be modified by setting `EAR_LOAD_BALANCE_TH` environment variable.
|
|
|
|
For example, if you want to be more permissive with the application load balance and prevent
|
|
|
|
per-process CPU frequency selection, you can increase the load balance threshold:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
...
|
|
|
|
|
|
|
|
export SLURM_EAR_LOAD_BALANCE=1
|
|
|
|
export SLURM_EAR_LOAD_BALANCE_TH=0.89
|
|
|
|
...
|
|
|
|
|
|
|
|
srun [...] my_app
|
|
|
|
```
|
|
|
|
|
|
|
|
## Support for Intel(R) Speed Select Technology
|
|
|
|
|
|
|
|
Since version 4.2, EAR supports the interaction with [Intel(R) Speed Select Technology (Intel(R) SST)](https://www.intel.com/content/www/us/en/architecture-and-technology/speed-select-technology-article.html)
|
|
|
|
which lets the user to have more fine grained control over per-CPU Turbo frequency.
|
|
|
|
This feature opens a door to users for getting more control over the performance
|
|
|
|
(also power consumption) across CPUs running their applications and jobs.
|
|
|
|
It is available on selected SKUs of Intel(R) Xeon(R) Scalable processors.
|
|
|
|
For more information about Intel(R) SST, below are listed useful links to official documentation:
|
|
|
|
|
|
|
|
* [Intel(R) SST-CP](https://networkbuilders.intel.com/solutionslibrary/intel-speed-select-technology-core-power-intel-sst-cp-overview-technology-guide)
|
|
|
|
* [Intel(R) SST-TF](https://networkbuilders.intel.com/solutionslibrary/intel-speed-select-technology-turbo-frequency-intel-sst-tf-overview-user-guide)
|
|
|
|
* [The Linux Kernel: Intel(R) Speed Select Technology User Guide](https://docs.kernel.org/admin-guide/pm/intel-speed-select.html)
|
|
|
|
|
|
|
|
EAR offers two environment variables that let to specify a list of priorities (CLOS) in two different ways.
|
|
|
|
The [first one](#EAR_PRIO_TASKS) will set a CLOS for each task involved in the job.
|
|
|
|
On the other hand, the [second offered variable](#ear_prio_cpus) will set a list of priorities per CPU involved in the job.
|
|
|
|
Values must be within the range of available CLOS that Intel(R) SST provides you.
|
|
|
|
|
|
|
|
If some of the two supported environment variables are set, EAR will set-up all of its internals transparently if the architecture supports it.
|
|
|
|
Also, it will restore configuration on the job ending.
|
|
|
|
If Intel(R) SST is not supported, no effect will occur.
|
|
|
|
If you enable [EARL verbosity](User-guide#ear-job-submission-flags) you will see
|
|
|
|
the mapping of the CLOS set for each CPU in the node.
|
|
|
|
Note that a `-1` value means that no change was done on the specific CPU.
|
|
|
|
|
|
|
|
### EAR_PRIO_TASKS
|
|
|
|
|
|
|
|
A list that specifies the CLOS that CPUs assigned to tasks must be set.
|
|
|
|
This variable is useful because you can configure your application transparently
|
|
|
|
without concerning about the affinity mask that the scheduler is assigning to your tasks.
|
|
|
|
You can use this variable when you know (or guess) your application's tasks workload
|
|
|
|
and you want to tune it by setting manually different Turbo priorities.
|
|
|
|
Note that you still need to ensure that different tasks do not share CPUs.
|
|
|
|
|
|
|
|
For example, imagine you want to submit a job that runs a MPI application with 16 tasks, each one
|
|
|
|
pinned on a single core, in a two-socket Intel(R) Xeon(R) Platinum 8352Y with 32 cores
|
|
|
|
each, with Hyper-threading enabled, i.e., each task will run on two CPUs and 32 of
|
|
|
|
the total 128 will be allocated by this application.
|
|
|
|
Below could be a (simplified) batch script that submits this example:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
|
|
|
|
export SLURM_EAR_PRIO_TASKS=0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
|
|
|
|
|
|
|
|
srun --ntasks=16 --cpu-bind=core,verbose --ear-policy=monitoring --ear-cpufreq=2201000 --ear-verbose=1 bin/bt.C.x
|
|
|
|
```
|
|
|
|
|
|
|
|
The above script sets CLOS 0 to tasks 0 to 3, CLOS 1 to tasks 4 to 7, CLOS 2 to
|
|
|
|
tasks 8 to 11 and CLOS 3 to tasks 12 to 15. The `srun` command binds each task
|
|
|
|
to one core (through `--cpu-bind` flag), sets the turbo frequency and enables EAR
|
|
|
|
verbosity.
|
|
|
|
Below there is the output message shown by the batch scheduler (i.e., SLURM):
|
|
|
|
|
|
|
|
```
|
|
|
|
cpu-bind=MASK - ice2745, task 0 0 [23363]: mask 0x10000000000000001 set
|
|
|
|
cpu-bind=MASK - ice2745, task 1 1 [23364]: mask 0x1000000000000000100000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 2 2 [23365]: mask 0x20000000000000002 set
|
|
|
|
cpu-bind=MASK - ice2745, task 3 3 [23366]: mask 0x2000000000000000200000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 4 4 [23367]: mask 0x40000000000000004 set
|
|
|
|
cpu-bind=MASK - ice2745, task 5 5 [23368]: mask 0x4000000000000000400000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 6 6 [23369]: mask 0x80000000000000008 set
|
|
|
|
cpu-bind=MASK - ice2745, task 7 7 [23370]: mask 0x8000000000000000800000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 8 8 [23371]: mask 0x100000000000000010 set
|
|
|
|
cpu-bind=MASK - ice2745, task 9 9 [23372]: mask 0x10000000000000001000000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 10 10 [23373]: mask 0x200000000000000020 set
|
|
|
|
cpu-bind=MASK - ice2745, task 11 11 [23374]: mask 0x20000000000000002000000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 12 12 [23375]: mask 0x400000000000000040 set
|
|
|
|
cpu-bind=MASK - ice2745, task 13 13 [23376]: mask 0x40000000000000004000000000 set
|
|
|
|
cpu-bind=MASK - ice2745, task 14 14 [23377]: mask 0x800000000000000080 set
|
|
|
|
cpu-bind=MASK - ice2745, task 15 15 [23378]: mask 0x80000000000000008000000000 set
|
|
|
|
```
|
|
|
|
|
|
|
|
We can see here that SLURM spreaded out tasks accross the two sockets of the node,
|
|
|
|
e.g., task 0 runs on CPUs 0 and 64, task 1 runs on CPUs 32 and 96.
|
|
|
|
Below output shows how EAR sets and verboses CLOS list per CPU in the node.
|
|
|
|
Following the same example, you can see that CPUs 0, 64, 32 and 96 have priority/CLOS 0.
|
|
|
|
Note that those CPUs not involved in the job show a -1.
|
|
|
|
|
|
|
|
```
|
|
|
|
Setting user-provided CPU priorities...
|
|
|
|
PRIO0: MAX GHZ - 0.0 GHz (high)
|
|
|
|
PRIO1: MAX GHZ - 0.0 GHz (high)
|
|
|
|
PRIO2: MAX GHZ - 0.0 GHz (low)
|
|
|
|
PRIO3: MAX GHZ - 0.0 GHz (low)
|
|
|
|
[000, 0] [001, 0] [002, 1] [003, 1] [004, 2] [005, 2] [006, 3] [007, 3]
|
|
|
|
[008,-1] [009,-1] [010,-1] [011,-1] [012,-1] [013,-1] [014,-1] [015,-1]
|
|
|
|
[016,-1] [017,-1] [018,-1] [019,-1] [020,-1] [021,-1] [022,-1] [023,-1]
|
|
|
|
[024,-1] [025,-1] [026,-1] [027,-1] [028,-1] [029,-1] [030,-1] [031,-1]
|
|
|
|
[032, 0] [033, 0] [034, 1] [035, 1] [036, 2] [037, 2] [038, 3] [039, 3]
|
|
|
|
[040,-1] [041,-1] [042,-1] [043,-1] [044,-1] [045,-1] [046,-1] [047,-1]
|
|
|
|
[048,-1] [049,-1] [050,-1] [051,-1] [052,-1] [053,-1] [054,-1] [055,-1]
|
|
|
|
[056,-1] [057,-1] [058,-1] [059,-1] [060,-1] [061,-1] [062,-1] [063,-1]
|
|
|
|
[064, 0] [065, 0] [066, 1] [067, 1] [068, 2] [069, 2] [070, 3] [071, 3]
|
|
|
|
[072,-1] [073,-1] [074,-1] [075,-1] [076,-1] [077,-1] [078,-1] [079,-1]
|
|
|
|
[080,-1] [081,-1] [082,-1] [083,-1] [084,-1] [085,-1] [086,-1] [087,-1]
|
|
|
|
[088,-1] [089,-1] [090,-1] [091,-1] [092,-1] [093,-1] [094,-1] [095,-1]
|
|
|
|
[096, 0] [097, 0] [098, 1] [099, 1] [100, 2] [101, 2] [102, 3] [103, 3]
|
|
|
|
[104,-1] [105,-1] [106,-1] [107,-1] [108,-1] [109,-1] [110,-1] [111,-1]
|
|
|
|
[112,-1] [113,-1] [114,-1] [115,-1] [116,-1] [117,-1] [118,-1] [119,-1]
|
|
|
|
[120,-1] [121,-1] [122,-1] [123,-1] [124,-1] [125,-1] [126,-1] [127,-1]
|
|
|
|
```
|
|
|
|
|
|
|
|
### EAR_PRIO_CPUS
|
|
|
|
|
|
|
|
A list of priorities that should have the same length as the number of CPUs your
|
|
|
|
job is using.
|
|
|
|
This configuration lets to set up CPUs CLOS in a more low level way:
|
|
|
|
**the *n-th* priority value of the list will set the priority of the *n-th* CPU your job is using.**
|
|
|
|
|
|
|
|
This way of configuring priorities rules the user to know exactly the affinity of its job's tasks
|
|
|
|
before launching the application, so it becomes harder to use if your goal is the same as the one
|
|
|
|
you can get by setting the [above environment variable](#ear_prio_tasks): task-focused CLOS setting.
|
|
|
|
But it becomes more flexible when the user has more control over the affinity set
|
|
|
|
to its application, because you can discriminate between different CPUs assigned to the same task.
|
|
|
|
Moreover, this is the only way to set different priorities over different threads in no-MPI applications.
|
|
|
|
|
|
|
|
## Disabling EAR's affinity masks usage
|
|
|
|
|
|
|
|
For both [Load Balancing](load-balancing) and [Intel(R) SST](#support-for-intel-r-speed-select-technology)
|
|
|
|
support, EAR uses processes' affinity mask read at the beginning of the job.
|
|
|
|
If you are working on an application that changes (or may change) the affinty mask of tasks, this can lead some miss configuration not detected by EAR.
|
|
|
|
To avoid any unexpected problem, **we highly recommend you** to export `EAR_NO_AFFINITY_MASK` environment variable (**even your are not planning to work with some of the mentioned features**).
|
|
|
|
|
|
|
|
# Data gathering
|
|
|
|
|
|
|
|
## EAR_GET_MPI_STATS
|
|
|
|
|
|
|
|
Use this variable to generate two files at the end of the job execution that will contain global, per process MPI information.
|
|
|
|
You must specify the prefix (optionally with a path) of the filename. One file (*[path/]prefix.ear_mpi_stats.full_nodename.csv*) will contain a resume about MPI throughput (per-process), while the other one (*[path/]prefix.ear_mpi_calls_stats.full_nodename.csv*) will contain a more fine grained information about different MPI call types.
|
|
|
|
Here is an example:
|
|
|
|
|
|
|
|
```
|
|
|
|
!#/bin/bash
|
|
|
|
|
|
|
|
#SBATCH -j mpi_job_name
|
|
|
|
#SBATCH -n 48
|
|
|
|
|
|
|
|
MPI_INFO_DST=$SLURM_JOBID-mpi_stats
|
|
|
|
mkdir $MPI_INFO_DST
|
|
|
|
|
|
|
|
export SLURM_EAR_GET_MPI_STATS=$MPI_INFO_DST/$SLURM_JOB_NAME
|
|
|
|
|
|
|
|
srun -n 48 --ear=on ./mpi_app
|
|
|
|
```
|
|
|
|
|
|
|
|
At the end of the job, two files will be created at the directory named *<job_id>-mpi_stats* located in the same directory where the application was submitted.
|
|
|
|
They will be named *mpi_job_name.ear_mpi_stats.full_nodename.csv* and *mpi_job_name.ear_mpi_calls_stats.full_nodename.csv*.
|
|
|
|
File pairs will be created for each node involved in the job.
|
|
|
|
|
|
|
|
Take into account that each process appends its own MPI statistics to files. This behavior does not guarantee that the header of files will be on the first line of them, as only one process writes it. You must move it at the top of each file manually before reading them with some tool you use to visualize and work with CSV files, e.g., spreadsheet, a R or Python package.
|
|
|
|
|
|
|
|
Below table shows fields available by **ear_mpi_stats** file:
|
|
|
|
|
|
|
|
| Field | Description |
|
|
|
|
| ----- | ----------- |
|
|
|
|
| mrank | The EAR's internal node ID used to identify the node. |
|
|
|
|
| lrank | The EAR's internal rank ID used to identify the process. |
|
|
|
|
| total\_mpi\_calls | The total number of MPI calls. |
|
|
|
|
| exec\_time | The execution time, in microseconds.
|
|
|
|
| mpi\_time | The time spent in MPI calls, in microseconds. |
|
|
|
|
| perc\_mpi\_time| The percentage of total execution time (i.e., *exec\_time*) spent in MPI calls. |
|
|
|
|
|
|
|
|
Below table shows fields available by **ear_mpi_calls_stats** file:
|
|
|
|
|
|
|
|
| Field | Description |
|
|
|
|
| ----- | ----------- |
|
|
|
|
| Master | The EAR's internal node ID used to identify the node. |
|
|
|
|
| Rank | The EAR's internal rank ID used to identify the process. |
|
|
|
|
| Total MPI calls | The total number of MPI calls. |
|
|
|
|
| MPI\_time/Exec\_time | The ration between time spent in MPI calls and the total execution time. |
|
|
|
|
| Exec\_time | The execution time, in microseconds. |
|
|
|
|
| Sync\_time | Time spent (in microseconds) in **blocking** synchronization calls, i.e., MPI\_Wait, MPI\_Waitall, MPI\_Waitany, MPI\_Waitsome and MPI\_Barrier. |
|
|
|
|
| Block\_time | Time spent in blocking calls, i.e., MPI\_Allgather, MPI\_Allgatherv, MPI\_Allreduce, MPI\_Alltoall, MPI\_Alltoallv, MPI\_Barrier, MPI\_Bcast, MPI\_Bsend, MPI\_Cart\_create, MPI\_Gather, MPI\_Gatherv, MPI\_Recv, MPI\_Reduce, MPI\_Reduce\_scatter, MPI\_Rsend, MPI\_Scan, MPI\_Scatter, MPI\_Scatterv, MPI\_Send, MPI\_Sendrecv, MPI\_Sendrecv\_replace, MPI\_Ssend and all *Wait* calls of **Sync\_time** field. |
|
|
|
|
| Collec\_time | Time spent in **blocking** collective calls, i.e., MPI\_Allreduce, MPI\_Reduce and MPI\_Reduce\_scatter. |
|
|
|
|
| Total MPI sync calls | Total number of synchronization calls. |
|
|
|
|
| Total blocking calls | Total number of blocking calls. |
|
|
|
|
| Total collective calls | Total number of collective calls. |
|
|
|
|
| Gather | Total number of **blocking** Gather calls, i.e., MPI\_Allgather, MPI\_Allgatherv, MPI\_Gather and MPI\_Gatherv. |
|
|
|
|
| Reduce | Total number of **blocking** Reduce calls, i.e., MPI\_Allreduce, MPI\_Reduce and MPI\_Reduce\_scatter.
|
|
|
|
| All2all | Total number of **blocking** All2all calls, i.e., MPI\_Alltoall and MPI\_Alltoallv.
|
|
|
|
| Barrier | Total number of **blocking** Barrier calls, i.e., MPI\_Barrier.
|
|
|
|
| Bcast | Total number of **blocking** Bcast calls, i.e., MPI\_Bcast.
|
|
|
|
| Send | Total number of **blocking** Send calls, i.e., MPI\_Bsend, MPI\_Rsend, MPI\_Send and MPI\_Ssend.
|
|
|
|
| Comm | Total number of **blocking** Comm calls, i.e., MPI\_Cart\_create.
|
|
|
|
| Receive | Total number of **blocking** Receive calls, i.e., MPI\_Recv.
|
|
|
|
| Scan | Total number of **blocking** Scan calls, i.e., MPI\_Scan.
|
|
|
|
| Scatter | Total number of **blocking** Scatter calls, i.e., MPI\_Scatter and MPI\_Scatterv.
|
|
|
|
| SendRecv | Total number of **blocking** SendRecv calls, i.e., MPI\_Sendrecv, MPI\_Sendrecv\_replace.
|
|
|
|
| Wait | Total number of **blocking** Wait calls, i.e., all MPI\_Wait calls.
|
|
|
|
| t_Gather | Time (in microseconds) spent in **blocking** Gather calls.
|
|
|
|
| t_Reduce | Time (in microseconds) spent in **blocking** Reduce calls.
|
|
|
|
| t_All2all | Time (in microseconds) spent in **blocking** All2all calls.
|
|
|
|
| t_Barrier | Time (in microseconds) spent in **blocking** Barrier calls.
|
|
|
|
| t_Bcast | Time (in microseconds) spent in **blocking** Bcast calls.
|
|
|
|
| t_Send | Time (in microseconds) spent in **blocking** Send calls.
|
|
|
|
| t_Comm | Time (in microseconds) spent in **blocking** Comm calls.
|
|
|
|
| t_Receive | Time (in microseconds) spent in **blocking** Receive calls.
|
|
|
|
| t_Scan | Time (in microseconds) spent in **blocking** Scan calls.
|
|
|
|
| t_Scatter | Time (in microseconds) spent in **blocking** Scatter calls.
|
|
|
|
| t_SendRecv | Time (in microseconds) spent in **blocking** SendRecv calls.
|
|
|
|
| t_Wait | Time (in microseconds) spent in **blocking** Wait calls.
|
|
|
|
|
|
|
|
## EAR_TRACE_PLUGIN
|
|
|
|
|
|
|
|
EAR offers the chance to generate Paraver traces to visualize runtime metrics with the [Paraver tool](https://tools.bsc.es/paraver).
|
|
|
|
Paraver is a visualization tool developed by CEPBA-Tools team and currently maintained by the Barcelona Supercomputing Center’s tools team.
|
|
|
|
|
|
|
|
The EAR trace generation mechanism was designed to support different trace generation plug-ins although the Paraver trace plug-in is the only supported by now.
|
|
|
|
You must set the value of this variable to `tracer_paraver.so` to load the tracer.
|
|
|
|
This shared object comes with the official EAR distribution and it is located at `$EAR_INSTALL_PATH/lib/plugins/tracer`.
|
|
|
|
Then you need to set the `EAR_TRACE_PATH` variable (see below) to specify the destination path of the generated Paraver traces.
|
|
|
|
|
|
|
|
## EAR_TRACE_PATH
|
|
|
|
|
|
|
|
Specify the path where you want to store the trace files generated by the EAR Library. The path must be fully created. Otherwise, the Paraver tracer plug-in won’t be loaded.
|
|
|
|
|
|
|
|
Here is an example of the usage of the above explained environment variables:
|
|
|
|
|
|
|
|
```
|
|
|
|
#!/bin/bash
|
|
|
|
...
|
|
|
|
|
|
|
|
export SLURM_EAR_TRACE_PLUGIN=tracer_paraver.so
|
|
|
|
export SLURM_EAR_TRACE_PATH=$(pwd)/traces
|
|
|
|
mkdir -p $SLURM_EAR_TRACE_PATH
|
|
|
|
|
|
|
|
srun -n 10 --ear=on ./mpi_app
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
## REPORT_EARL_EVENTS
|
|
|
|
|
|
|
|
Use this variable (i.e., `export SLURM_REPORT_EARL_EVENTS=1`) to make EARL send internal events to the [Database](EAR-Database).
|
|
|
|
These events are useful to have more information about Library's behaviour, like
|
|
|
|
when DynAIS **(REFERENCE DYNAIS)** is turned off, the computational phase EAR is guessing the application is on
|
|
|
|
or the status of the applied policy **(REF POLICIES)**.
|
|
|
|
You can query job-specific events through `eacct -j <JobID> -x`, and you will get
|
|
|
|
a table of all reported events:
|
|
|
|
|
|
|
|
| Field name | Description |
|
|
|
|
| ---------- | ----------- |
|
|
|
|
| Event_ID | Internal ID of the event stored at the Database. |
|
|
|
|
| Timestamp | *yyyy-mm-dd hh:mm:ss*. |
|
|
|
|
| Event_type | Which kind of event is it. Possible event types explained below. |
|
|
|
|
| Job_id | The JobID of the event. |
|
|
|
|
| Value | The value stored with the event. Categorical events explained below. |
|
|
|
|
| node_id | The node from where the event was reported. |
|
|
|
|
|
|
|
|
### Event types
|
|
|
|
|
|
|
|
Below are listed all kind of event types you can get when requesting job events.
|
|
|
|
For categorical event values, the (value, category) mapping is explained.
|
|
|
|
|
|
|
|
- **policy_error** Reported when the policy couldn't select the optimal frequency.
|
|
|
|
- **dynais_off** Reported when DynAIS is turned off and the Library becomes in *periodic monitoring mode*.
|
|
|
|
- **earl_state** The internal EARL state. Possible values are:
|
|
|
|
- **0** This is the initial state and stands for no iteration detected.
|
|
|
|
- **1** EAR starts computing the signature
|
|
|
|
- **2** EAR computes the local signature and executes the per-node policy.
|
|
|
|
- **3** This state computes a new signature and evaluates the accuracy of the policy.
|
|
|
|
- **4** Projection error.
|
|
|
|
- **5** This is a transition state to recompute EARL timings just in case we need to adapt it because of the frequency selection.
|
|
|
|
- **6** Signature has changed.
|
|
|
|
- **optim_accuracy** The internal optimization policy state. Possible values are:
|
|
|
|
- **0** Policy not ready.
|
|
|
|
- **1** Policy says all is ok.
|
|
|
|
- **2** Policy says it'n not ok.
|
|
|
|
- **3** Policy wants to try again to optimize.
|
|
|
|
|
|
|
|
> The above event types may be useful only for advanced users. Please, contact with
|
|
|
|
[ear-support@bsc.es](mailto:ear-support@bsc.es) if you want to know more about EARL internals.
|
|
|
|
|
|
|
|
- **energy_saving** Energy (in %) EAR is guessing the policy is saving.
|
|
|
|
- **power_saving** Power in (in %) EAR is guessing the policy is saving.
|
|
|
|
- **performance_penalty** Execution time (in %) EAR is guessing the policy is incrementing. |
|
|
|
\ No newline at end of file |