|
|
# Job monitoring and optimization with EAR
|
|
# Job monitoring and optimization with EAR
|
|
|
|
|
|
|
|
EAR was first designed to be usable 100% transparently by users, which means that
|
|
EAR was first designed to be 100% transparent by users, which means that
|
|
|
you can run your applications enabling/disabling/tuning EAR with the less effort
|
|
you can run your applications enabling/disabling/tuning EAR with minimal effort
|
|
|
for changing your workflow, e.g., submission scripts.
|
|
for changing your workflow, e.g., submission scripts.
|
|
|
This is achieved by providing integrations (e.g., plug-ins, hooks) with system batch
|
|
This is achieved by providing integrations (e.g., plug-ins, hooks) with system batch
|
|
|
schedulers, which do all the effort to set-up EAR at job submission.
|
|
schedulers, which do all the effort to set-up EAR at job submission.
|
|
|
By now, **[SLURM](https://slurm.schedmd.com/documentation.html) is the batch scheduler full compatible with EAR** thanks to EAR's SLURM SPANK plug-in.
|
|
By now, **[SLURM](https://slurm.schedmd.com/documentation.html) is the batch scheduler fully compatible with EAR** thanks to EAR's SLURM SPANK plug-in.
|
|
|
|
|
|
|
|
With EAR's SLURM plug-in, running an application with EAR is as easy as submitting
|
|
With EAR's SLURM plug-in, running an application with EAR is as easy as submitting
|
|
|
a job with either `srun`, `sbatch` or `mpirun`. The EAR Library (EARL) is automatically
|
|
a job with either `srun`, `sbatch` or `mpirun`. The EAR Library (EARL) is automatically
|
| ... | @@ -39,10 +39,10 @@ EAR supports the utilization of both `mpirun`/`mpiexec` and `srun` commands. |
... | @@ -39,10 +39,10 @@ EAR supports the utilization of both `mpirun`/`mpiexec` and `srun` commands. |
|
|
When using `sbatch`/`srun` or `salloc`, [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.mufipm)
|
|
When using `sbatch`/`srun` or `salloc`, [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.mufipm)
|
|
|
and [OpenMPI](https://www.open-mpi.org/) are fully supported.
|
|
and [OpenMPI](https://www.open-mpi.org/) are fully supported.
|
|
|
When using specific MPI flavour commands to start applications (e.g., `mpirun`,
|
|
When using specific MPI flavour commands to start applications (e.g., `mpirun`,
|
|
|
`mpiexec.hydra`), there are some keypoints which you must take account.
|
|
`mpiexec.hydra`), there are some key points which you must take account.
|
|
|
See [next sections](#using-mpirunmpiexec-command) for examples and more details.
|
|
See [next sections](#using-mpirunmpiexec-command) for examples and more details.
|
|
|
|
|
|
|
|
Review SLURM's [MPI Users Guide](https://slurm.schedmd.com/mpi_guide.html), read your cluster documentation or ask your system administrator to see how SLURM is integrated with the MPI Library in your system.
|
|
Review SLURM's [MPI Users Guide](https://slurm.schedmd.com/mpi_guide.html), read your cluster documetation or ask your system administrator to see how SLURM is integrated with the MPI Library in your system.
|
|
|
|
|
|
|
|
### Hybrid MPI + (OpenMP, CUDA, MKL) applications
|
|
### Hybrid MPI + (OpenMP, CUDA, MKL) applications
|
|
|
|
|
|
| ... | @@ -51,10 +51,10 @@ EARL automatically supports this use case. |
... | @@ -51,10 +51,10 @@ EARL automatically supports this use case. |
|
|
|
|
|
|
|
### Python and Julia MPI applications
|
|
### Python and Julia MPI applications
|
|
|
|
|
|
|
|
EARL cannot detect automatically MPI symbols when some of these languages is used.
|
|
EARL cannot detect automatically MPI symbols when some of these languages are used.
|
|
|
On that case, an environment variable is provided to give EARL a hint of the MPI flavour being used.
|
|
On that case, an environment variable is provided to give EARL a hint of the MPI flavour being used.
|
|
|
|
|
|
|
|
Export [`EAR_LOAD_MPI_VERSION`](EAR-environment-variables#ear_load_mpi_version) environment with the value from the following table depending on the MPI implementation you are loading:
|
|
Export the [`EAR_LOAD_MPI_VERSION`](EAR-environment-variables#ear_load_mpi_version) environment with the value from the following table depending on the MPI implementation you are loading:
|
|
|
|
|
|
|
|
| MPI flavour | Value |
|
|
| MPI flavour | Value |
|
|
|
| ----------- | ----- |
|
|
| ----------- | ----- |
|
| ... | @@ -122,21 +122,21 @@ Check its [section](#mpi4py) for running this kind of use case. |
... | @@ -122,21 +122,21 @@ Check its [section](#mpi4py) for running this kind of use case. |
|
|
|
|
|
|
|
### Python
|
|
### Python
|
|
|
|
|
|
|
|
Since version 4.1 EAR automatically executes the Library with Python applications, so no action is needed.
|
|
Since version 4.1 EAR automatically executes the Library with Python applications, so no further action is needed.
|
|
|
You must run the application with `srun` command to pass through the EAR's SLURM SPANK plug-in in order to enable/disable/tuning EAR.
|
|
You must run the application with `srun` command to pass through the EAR's SLURM SPANK plug-in in order to enable/disable/tuning EAR.
|
|
|
See [EAR submission flags](#ear-job-submission-flags) provided by EAR SLURM integration.
|
|
See [EAR submission flags](#ear-job-submission-flags) provided by EAR SLURM integration.
|
|
|
|
|
|
|
|
### OpenMP, CUDA, Intel MKL and OneAPI
|
|
### OpenMP, CUDA, Intel MKL and OneAPI
|
|
|
|
|
|
|
|
To load EARL automatically with non-MPI applications it is required to have it compiled with dynamic symbols and also it must be executed with `srun` command.
|
|
To load EARL automatically with non-MPI applications it is required to have it compiled with dynamic symbols and also it must be executed with `srun` command.
|
|
|
For example, for CUDA applications the `--cudart=shared` option must be used at compile time.
|
|
For example, for CUDA applications you must use the `--cudart=shared` option at compile time.
|
|
|
EARL is loaded for OpenMP, MKL and CUDA programming models when symbols are dynamically detected.
|
|
EARL is loaded for OpenMP, MKL and CUDA programming models when symbols are dynamically detected.
|
|
|
|
|
|
|
|
## Other application types or frameworks
|
|
## Other application types or frameworks
|
|
|
|
|
|
|
|
For other programming models or sequential apps not supported by default, EARL can
|
|
For other programming models or sequential apps not supported by default, EARL can
|
|
|
be forced to be loaded by setting [`EAR_LOADER_APPLICATION`](EAR-environment-variables#ear_loader_application)
|
|
be forced to be loaded by setting [`EAR_LOADER_APPLICATION`](EAR-environment-variables#ear_loader_application)
|
|
|
enviroment variable, which must be defined with the executable name.
|
|
environment variable, which must be defined with the executable name.
|
|
|
For example:
|
|
For example:
|
|
|
|
|
|
|
|
```
|
|
```
|
| ... | @@ -150,7 +150,7 @@ srun my_app |
... | @@ -150,7 +150,7 @@ srun my_app |
|
|
|
|
|
|
|
[Apptainer](https://apptainer.org/) (formerly Singularity) is an open source technology for containerization.
|
|
[Apptainer](https://apptainer.org/) (formerly Singularity) is an open source technology for containerization.
|
|
|
It is widely used in HPC contexts because the level of virtualization it offers enables the access to local services.
|
|
It is widely used in HPC contexts because the level of virtualization it offers enables the access to local services.
|
|
|
It allows for geater reproducibility, making the programs less dependant on the environment they are being run on.
|
|
It allows for greater reproducibility, making the programs less dependent on the environment they are being run on.
|
|
|
|
|
|
|
|
An example singularity command could look something like this:
|
|
An example singularity command could look something like this:
|
|
|
|
|
|
| ... | @@ -171,12 +171,12 @@ To bind folders there are two options: (1) using the environment variable `SINGU |
... | @@ -171,12 +171,12 @@ To bind folders there are two options: (1) using the environment variable `SINGU |
|
|
Specifying *path_2* and *perm* is optional.
|
|
Specifying *path_2* and *perm* is optional.
|
|
|
If they are not specified *path_1* will be bound in the same location.
|
|
If they are not specified *path_1* will be bound in the same location.
|
|
|
|
|
|
|
|
To make EAR working the following paths sould be added to the binding configuration:
|
|
To make EAR working the following paths should be added to the binding configuration:
|
|
|
|
|
|
|
|
- `$EAR_INSTALL_PATH,$EAR_INSTALL_PATH/bin,$EAR_INSTALL_PATH/lib,$EAR_TMP`
|
|
- `$EAR_INSTALL_PATH,$EAR_INSTALL_PATH/bin,$EAR_INSTALL_PATH/lib,$EAR_TMP`
|
|
|
|
|
|
|
|
You should have an EAR module to have the above environment variables.
|
|
You should have an EAR module to have the above environment variables.
|
|
|
Contact with your system administrator for more information.
|
|
Contact your system administrator for more information.
|
|
|
|
|
|
|
|
Once paths are deployed, to execute (for example) an OpenMPI application inside a Singularity/Apptainer enabling the EAR Library just the following is needed:
|
|
Once paths are deployed, to execute (for example) an OpenMPI application inside a Singularity/Apptainer enabling the EAR Library just the following is needed:
|
|
|
|
|
|
| ... | @@ -206,7 +206,7 @@ See [next section](#runtime-report-plug-ins) about report plug-ins. |
... | @@ -206,7 +206,7 @@ See [next section](#runtime-report-plug-ins) about report plug-ins. |
|
|
|
|
|
|
|
COMP Superscalar ([COMPSs](https://compss-doc.readthedocs.io/en/latest/index.html)) is a task-based programming model which aims to ease the development of applications for distributed infrastructures, such as large High-Performance clusters (HPC), clouds and container managed clusters.
|
|
COMP Superscalar ([COMPSs](https://compss-doc.readthedocs.io/en/latest/index.html)) is a task-based programming model which aims to ease the development of applications for distributed infrastructures, such as large High-Performance clusters (HPC), clouds and container managed clusters.
|
|
|
COMPSs provides a programming interface for the development of the applications and a runtime system that exploits the inherent parallelism of applications at execution time.
|
|
COMPSs provides a programming interface for the development of the applications and a runtime system that exploits the inherent parallelism of applications at execution time.
|
|
|
**Since version 5.0 EAR supports monitoring and optimization of workflows** and the COMPSs Framework includes the integration with EAR.
|
|
**Since version 5.0 EAR supports monitoring and optimization of workflows** and the COMPSs Framework includes integration with EAR.
|
|
|
Check out the [dedicated section](https://compss-doc.readthedocs.io/en/latest/Sections/05_Tools/05_EAR.html#) from the official COMPSs documentation for more information about how to measure the energy consumption of your workflows.
|
|
Check out the [dedicated section](https://compss-doc.readthedocs.io/en/latest/Sections/05_Tools/05_EAR.html#) from the official COMPSs documentation for more information about how to measure the energy consumption of your workflows.
|
|
|
|
|
|
|
|
EARL loading is **only available** using `enqueue_compss` and with Python applications.
|
|
EARL loading is **only available** using `enqueue_compss` and with Python applications.
|
| ... | @@ -225,7 +225,7 @@ As a very simple hint of your application workload, you can enable EARL verbosit |
... | @@ -225,7 +225,7 @@ As a very simple hint of your application workload, you can enable EARL verbosit |
|
|
**The information is shown at _stderr_ by default.**
|
|
**The information is shown at _stderr_ by default.**
|
|
|
Read how to set up verbosity at [submission time](#ear-job-submission-flags) and
|
|
Read how to set up verbosity at [submission time](#ear-job-submission-flags) and
|
|
|
[verbosity environment variables](EAR-environment-variables#verbosity) provided
|
|
[verbosity environment variables](EAR-environment-variables#verbosity) provided
|
|
|
for a more advanced tunning of this EAR feature.
|
|
for a more advanced tuning of this EAR feature.
|
|
|
|
|
|
|
|
## Post-mortem application data
|
|
## Post-mortem application data
|
|
|
|
|
|
| ... | @@ -500,7 +500,7 @@ It has an internal system to avoid repeating functions that are executed just on |
... | @@ -500,7 +500,7 @@ It has an internal system to avoid repeating functions that are executed just on |
|
|
time per job or node, like SLURM does with its plugins.
|
|
time per job or node, like SLURM does with its plugins.
|
|
|
|
|
|
|
|
**IMPORTANT NOTE** If you are going to launch `n` applications with `erun` command through a sbatch job, you must set the environment variable `SLURM_STEP_ID` to values from `0` to `n-1` before each `mpirun` call.
|
|
**IMPORTANT NOTE** If you are going to launch `n` applications with `erun` command through a sbatch job, you must set the environment variable `SLURM_STEP_ID` to values from `0` to `n-1` before each `mpirun` call.
|
|
|
By this way `erun` will inform the EARD the correct step ID to be stored then to the Database.
|
|
By this way `erun` will inform the EARD the correct step ID to be then stored to the Database.
|
|
|
|
|
|
|
|
# EAR job Accounting (`eacct`)
|
|
# EAR job Accounting (`eacct`)
|
|
|
|
|
|
| ... | @@ -582,4 +582,4 @@ The Library deals with job monitoring and is the component which implements and |
... | @@ -582,4 +582,4 @@ The Library deals with job monitoring and is the component which implements and |
|
|
optimization policies based on monitored workload.
|
|
optimization policies based on monitored workload.
|
|
|
|
|
|
|
|
**We highly recommend you** to read [EARL](https://gitlab.bsc.es/ear_team/ear/-/wikis/Architecture#the-ear-library-job-manager) documentation and also how energy policies work
|
|
**We highly recommend you** to read [EARL](https://gitlab.bsc.es/ear_team/ear/-/wikis/Architecture#the-ear-library-job-manager) documentation and also how energy policies work
|
|
|
in order to better understand what is doing the Library internally, so you will can explore easily all features (e.g., tunning variables, collecting data) EAR offers to the end-user so you will have more knowledge about how much resources your application consumes and how to correlate with its computational characteristics. |
|
in order to better understand what is doing the Library internally, so you can explore easily all features (e.g., tunning variables, collecting data) EAR offers to the end-user so you will have more knowledge about how much resources your application consumes and how to correlate with its computational characteristics. |