Oriol Vidal · 37c9ef4a
--- a/User-guide.md
+++ b/User-guide.md
-Running applications with EAR
+[[_TOC_]]
------------------------------
-With EAR's SLURM plugin, running an application with EAR is as easy as submitting a job with either `srun`, `sbatch` or `mpirun` with SLURM. There are multiple configuration settings that can be set to customize EAR's behaviour which are explained below, as well as examples on how to run applications with each method.
+# Running jobs with EAR
-For other schedulers a simple prolog/epilog command can be created to provide transparent job submission with EAR and default configuration.
+With EAR's SLURM plugin, running an application with EAR is as easy as submitting
+a job with either `srun`, `sbatch` or `mpirun`. The EAR Library is automatically
+loaded with some applications when EAR is enabled by default.
-Note: mpi4py is not supported. Use "--ear=off" in case EAR library is on by default
+You can type `ear-info` to see whether EAR is turned on by default.
+For other schedulers, a simple prolog/epilog command can be created to provide
+transparent job submission with EAR and default configuration.
+# Use cases
+## MPI applications
+EAR Library is automatically loaded with MPI applications when EAR is enabled by
+default (check `ear-info`). EAR supports the utilization of both
+`mpirun`/`mpiexec` and `srun` commands.
+When using `sbacth`/`srun` or `salloc`, Intel MPI and OpenMPI are fully supported.
+When using specific MPI flavour commands to start applications (e.g., `mpirun`, `mpiexec.hydra`),
+there are some keypoints which you must take account. See next sections for examples and more
+details.
+## Hybrid MPI + (OpenMP, CUDA, MKL) applications
+EAR Library automatically supports this use case. Check with the `ear-info` command
+if EAR library is `on`/`off` by default. If it’s `off`, use `--ear=on` option
+offered by EAR SLURM plugin to enable it. `mpirun`/`mpiexec` and `srun` are supported
+in the same manner as explained above.
+## Python (not MPI)
+EAR version 4.1 automatically executes the EAR Library with Python applications,
+so no action is needed. Check with the `ear-info` command
+if EAR library is `on`/`off` by default. If it’s `off`, use `--ear=on` option
+offered by EAR SLURM plugin to enable it.
+## Python + MPI applications
+EAR Library cannot detect automatically MPI symbols when Python is used. On that case,
+an environment variable used to specify which MPI flavour is provided. Export
+`SLURM_EAR_LOAD_MPI_VERSION` environment variable with either _intel_ or _open mpi_
+values, e.g., `export SLURM_EAR_LOAD_MPI_VERSION="open mpi"`, whose are the two MPI
+implementations 100% supported by EAR.
+Check with the `ear-info` command if EAR library is `on`/`off` by default.
+If it’s `off`, use `--ear=on` option offered by EAR SLURM plugin to enable it.
+## OpenMP, CUDA, MK (non-MPI) applications
+To load the EAR Library automatically with non MPI applications it is required to
+have it compiled with dynamic symbols and also it must be executed with `srun` command.
+For example, for CUDA applications the `--cudart=shared` option must be used.
+EARL is loaded for OpenMP, MKL and CUDA programming models when symbols are dynamically detected.
+## Other application types or frameworks
+For other programming models or sequential apps not supported by default, EARL can
+be forced to be loaded by setting `SLURM_EAR_LOADER_APPLICATION` enviroment variable,
+defined with the application name.
+```
+#!/bin/bash
+export SLURM_EAR_LOADER_APPLICATION=my_app
+srun my_app
+```
+# MPI + srun
+Running MPI applications with EARL is automatic for SLURM systems when
+using `srun`. All the jobs are monitored by EAR and the Library is loaded by default
+depending on the cluster configuration.
+To run a job with srun and EARL there is no need to load the EAR module.
+Even though it is automatic, there are few flags than can be selected at job submission.
+When using slurm commands for job submission, both Intel and OpenMPI implementations are
+supported.
+## EAR job submission flags
-## Job submission with EAR and SLURM
 The following EAR options can be specified when running `srun` and/or `sbatch`, and are supported with `srun`/`sbatch`/`salloc`:
 |     Options                | Description                                                          |
 | -------------------------- | -------------------------------------------------------------------- |
-| --ear=on/off(**)           | Enables/disables EAR library loading with this job.                                        |
+| \-\-ear=\[on\|off\]           | Enables/disables EAR library loading with this job.                 |
-| --ear-policy=policy        | Selects an energy policy for EAR. See the [Policies page](EAR-policies) for more info                                                               |
+| \-\-ear-user-db=_\<filename\>_         | Asks the EAR Library to generate a set of CSV files with EARL metrics. One file per node is generated with the average node metrics (node signature) and one file with multiple lines per node is generated with runtime collected metrics (loops node signatures).                                                   |
-| --ear-cpufreq=frequency(*) | Specifies  the starting frequency to be used by the chosen EAR policy (in KHz).                                                                         |
+| \-\-ear-verbose=\[0\|1\]        | Specifies the level of verbosity; the default is 0. Verbose messages are placed by default in _stderr_. For jobs with multiple nodes, this option can result in lots of messages mixed at _stderr_. You can set `SLURM_EARL_VERBOSE_PATH` environment variable and one file per node will be generated with EAR output. The environemnt variable must be set with the path (a directory) where you want the output files to be generated, it will be automatically created if needed.                                                                                   |
-| --ear-policy-th=value(*)   | Specifies the ear_threshold to be used by the chosen EAR policy {`value=[0...1]`}.                                                                      |
-| --ear-user-db=file         | Specifies the files where the user applications' metrics summary will be stored {'file.nodename.csv'}. If not defined, these files will not be created. |
+For more information consult `srun --help` output or see configuration options sections for more detailed description.
-| --ear-verbose=value        | Specifies the level of verbosity {value=[0...1]}; the default is 0.  |
-| --ear-tag=tag              | Selects an energy tag.                                               |
+## CPU frequency selection
-| --ear-learning=p_state(*)  | Enables the learning phase for a given P_STATE {`p_state=[1...n]`}.  |
+The EAR configuration files supports the specification of *EAR authorized users*,
+who can ask for a more privileged submission options. The most relevant ones are the possibility
+to ask for a specific optimisation policy and a specific CPU frequency. Contact
+with sysadmin or helpdesk team to become an authorized user.
+- The `--ear-policy=policy_name` flag asks for _policy_name_ policy. Type `srun --help` to see policies currently installed in your system.
+- The `--ear-cpufreq=value` (_value_ must be given in kHz) asks for a specific CPU frequency.
+## GPU frequency selection
+EAR version 3.4 and upwards supports GPU monitoring for NVIDIA devices from the
+point of view of the application and node monitoring. GPU frequency optimization
+is not yet supported. **Authorized** users can ask for a specific GPU frequency
+by setting the `SLURM_EAR_GPU_DEF_FREQ` environment variable, giving the desired
+GPU frequency expressed in kHz.
+Only one frequency for all GPUs is now supported.
+Contact with sysadmin or helpdesk team to become an authorized user.
+To see the list of available frequencies of the GPU you will work on, you can type the following command:
+```
+nvidia-smi -q -d SUPPORTED_CLOCKS
+```
- For more information consult `srun --help` output or see configuration options sections for more detailed description.
+# MPI + mpirun
-(*) Option requires _ear privileges_ to be used.
+To provide an automatic loading of the EAR library, the only requirement from
-(**) Does not require _ear privileges_ but values might be limited by EAR configuration.
+the MPI library is to be coordinated with the scheduler.
-## GPU support
+## Intel MPI
-EAR version 3.4 and upwards supports GPU monitoring for NVIDIA devices from the point of view of the application and node monitoring. GPU frequency optimization is not yet supported. Authorized users can ask for a specific GPU frequency by setting the SLURM_EAR_GPU_DEF_FREQ environment variable. Only one frequency for all GPUs is now supported.
-## EAR library loading
+Recent versions of Intel MPI offers two environment variables that can be used
-EAR uses the EAR loader to automatically select the EAR optimization library version. This optimization library is automatically loaded when either an MPI, OpenMP, MKL or CUDA application is detected. Application identification is done based on symbols detection. I doesn't work for static symbols.
+to guarantee the correct scheduler integrations:
-## MPI versions supported
-When using sbacth/srun or salloc, Intel MPI and OpenMPI are 100% supported. When using mpi commands to start applications (mpirun, mpiexec.hydra, etc.), there are minor differences explained in examples below.
+- `I_MPI_HYDRA_BOOTSTRAP` sets the bootstrap server. It must be
+set to slurm.
+- `I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS` sets additional arguments for the bootstrap server. These arguments are passed
+to slurm.
-## Examples
+You can read [here](https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/hydra-environment-variables.html) the Intel environment variables guide.
+## OpenMPI
+For OpenMPI and EAR it is highly recommened to use SLURM. When
+using `mpirun`, as OpenMPI is not fully coordinated with the scheduler,
+the EAR Library is not automatilly loaded on all the nodes. If `mpirun` is
+used, tEARL will be disabled and only basic energy metrics will
+be reported.
+## MPI4PY
+To use MPI with Python applications, the EAR Loader cannot automatically detect symbols to classify
+the application as Intel or OpenMPI. In order to specify it, the user has
+to define the `SLURM_LOAD_MPI_VERSION` environment variable with the values _intel_ or
+_open mpi_. It is recommended to add in Python modules to make it easy for
+final users.
+## Using additional MPI profiling libraries/tools
+EAR uses the `LD_PRELOAD` mechanism to be loaded and the PMPI API for
+a transparent loading. In order to be compatible with other profiling libraries
+EAR is not replacing the MPI symbols, it just calls the next symbol in the list.
+So it is compatible with other tools or profiling libraries. In case of conflict, the
+EARL can be disabled by setting `--ear=off` flag at submission time.
+# Examples
 ### `srun` examples
-EAR plugin reads `srun` options and contacts with EARD. Invalid options are filtered to default values, so behaviour will depend on system configuration.
+Having an MPI application asking for one node and 24 tasks, the following is a
+simple case of job submission. If EAR library is turned on by default, no extra options
+are needed to load it. To check if it is on by default, load the EAR module and
+execute the `ear-info` command. EAR verbose is set to 0 by default (no messages).
- Executes an application with EAR on/off (depending on the configuration) with default values:
 ```
 srun -J test -N 1 -n 24 --tasks-per-node=24 application
 ```
- Executes an application with EAR on with default values (policy, default frequency, etc.) and verbose set to 1:
+The following executes the application showing EAR messages, including EAR configuration and node signature in _stderr_.
 ```
 srun --ear-verbose=1 -J test -N 1 -n 24 --tasks-per-node=24 application
 ```
-EARL verbose messages are generated in the stderr. For jobs using more than 2 or 3 nodes messages can be overwritten. If the user wants to have EARL messages in a file the SLURM_EARL_VERBOSE_PATH environment variable must be set with a folder name. One file per node will be generated with EARL messages.
- Executes an application with EAR on and verbose set to 1. If user is authorised, job will be executed at 2.0GHz as default frequency and with power policy set to monitoring. Otherwise, default values will be applied:
+EARL verbose messages are generated in the standard error. For jobs using more than 2 or 3 nodes messages can be overwritten. If the user wants to have EARL messages in a file the `SLURM_EARL_VERBOSE_PATH` environment variable must be set with a folder name. One file per node will be generated with EARL messages.
 ```
-srun --ear-cpufreq=2000000 --ear-policy=monitoring --ear-verbose=1 -J test -N 1 -n 24 --tasks-per-node=24 application
+export SLURM_EARL_VERBOSE_PATH=logs
+srun --ear-verbose=1 -J test -N 1 -n 24 --tasks-per-node=24 application
+```
+The following asks for EAR library metrics to be stored in csv file after
+the application execution. Two files per node will be generated: one with the average/global signature and another with loop signatures. The format of output files is _\<filename\>.\<nodename\>_.time.csv
+for the global signature and _\<filename\>.\<nodename\>_.time.loops.csv for loop signatures.
 ```
- Executes an application with EAR. If user is authorised to select the “memory-intensive” tag, the application will be executed according to the definition of the tag in the EAR configuration:
+srun -J test -N 1 -n 24 --tasks-per-node=24 --ear-user-db=filename application
 ```
-srun --ear-tag=memory-intensive --ear-verbose=1 -J test -N 1 -n 24 --tasks-per-node=24 application
+For EAR *authorized users*, the following executes the application with a CPU frequency of 2.0GHz:
+```
+srun --ear-cpufreq=2000000 --ear-policy=monitoring --ear-verbose=1 -J test -N 1 -n 24 --tasks-per-node=24 application
 ```
-### `sbatch` examples
-When using `sbatch`  EAR options can be specified in the same way. If more than one srun is included in the job submission, EAR options can be inherited from `sbatch` to the different `srun`s or specifically modified in each individual `srun`.
+For `--ear-cpufreq` to have any effect, you must specify the `--ear-policy` option even if you want to run your application with the default policy.
+## `sbatch` + EARL + srun
+When using `sbatch` EAR options can be specified in the same way. If more than one
+`srun` is included in the job submission, EAR options can be inherited from `sbatch` to the different `srun` instances or they can be specifically modified on each individual `srun`.
+The following example will execute twice the application. Both instances
+will have the verbosity set to 1. As the job is asking for 10 nodes, we have
+set the `SLURM_EARL_VERBOSE_PATH` environment variable set to the _ear_log_ folder.
+Moreover, the second step will create a set of csv files placed in the _ear_metrics_
+folder. The nodename, Job Id and Step Id are part of the filename for a better
+identification.
-The following example will set the ear verbose mode for all the job steps to 1. First job step will be executed with default settings and second one with monitoring as policy.
 ```
 #!/bin/bash
 #SBATCH -N 1
@@ -76,39 +220,42 @@ The following example will set the ear verbose mode for all the job steps to 1.
 #SBATCH --cpus-per-task=1
 #SBATCH --ear-verbose=1
-srun  application
+export SLURM_EARL_VERBOSE_PATH=ear_logs
-srun --ear-policy=monitoring application
-```
-### Running EAR with `mpirun` (in SLURM systems)
-#### Intel MPI
+srun application
-When running EAR with `mpirun` rather than `srun`, we have to specify the utilisation of `srun` as bootstrap. Otherwise jobs will not go through the SLURM plugin and any EAR options will not be recognised. The API depends on Intel version. Versions prior to 2018 use two `mpirun` arguments to specify the bootstrap and extra SLURM flags (to be passed to SLURM).
+mkdir ear_metrics
+srun --ear-user-db=ear_metrics/app_metrics application
+```
-The following example will run application with min_time_to_solution policy:
+## EARL + mpirun
-```mpirun -n 10 -bootstrap slurm -bootstrap-exec-args="--ear-policy=min_time” application```
+### Intel MPI
-Version 2019 and newer offers two environment variables rather than mpirun arguments.
+When running EAR with `mpirun` rather than `srun`, we have to specify the utilization of `srun` as bootstrap. Version 2019 and newer offers two environment variables for bootstrap server specification and arguments.
 ```
 export I_MPI_HYDRA_BOOTSTRAP=slurm
 export I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS="--ear-policy=monitoring --ear-verbose=1"
 mpiexec.hydra -n 10 application
 ```
-#### OpenMPI
+### OpenMPI
+Bootstrap is an Intel® MPI option but not an OpenMPI option. For OpenMPI
+`srun` must be used for an automatic EAR support. In case OpenMPI with
+`mpirun` is needed, EAR offers the `erun` comman explained below.
-Bootstrap is an Intel® MPI option but not an OpenMPI option. For OpenMPI `srun` must be used for an automatic EAR support, or use the `erun` program explained below.
+## erun
-ERUN
+*erun* is a program that simulates all the SLURM and EAR SLURM Plugin pipeline.
----
+You can launch erun with the `--program` option to specify the application name
-ERUN is a program that simulates all the SLURM and EAR SLURM Plugin pipeline. It comes with the EAR package and is compiled automatically. You can find it in in `bin` folder in your installation path. It must be used when a set of nodes does not have SLURM installed or when using OpenMPI `mpirun` which does not contact with SLURM. You can launch ERUN instead of directly run your application:
+and arguments.
 ```
 mpirun -n 4 /path/to/erun --program="hostname --alias"
 ```
-In this example, MPIRUN would run 4 ERUN processes. Then, ERUN would launch the application hostname with its alias parameter. You can use as many parameters as you want but the semicolons have to cover all the parameters in case there are more than just the program name. ERUN would simulate in the remote node both the local and remote pipelines for all created processes. It has an internal system to avoid repeating functions that are executed just one time per job or node, like SLURM does with its plugins.
+In this example, `mpirun` would run 4 `erun` processes. Then, `erun` would launch the application `hostname` with its alias parameter. You can use as many parameters as you want but the semicolons have to cover all the parameters in case there are more than just the program name. `erun` would simulate on the remote node both the local and remote pipelines for all created processes. It has an internal system to avoid repeating functions that are executed just one time per job or node, like SLURM does with its plugins.
 ```
 > erun --help
@@ -137,16 +284,105 @@ Also you have to load the EAR environment module or define its environment varia
 | EAR_ETC=\<path\>          | sysconfdir=\<path\>    |
 | EAR_DEFAULT=\<on/off\>    | default=<on/off\>      |
-Lastly, the tipical SLURM parameters can be passed to ERUN in the same way they were written to SRUN or SBATCH. In example:
+> NOTE If you are going to launch `n` applications with `erun` command through a sbatch job, you must set the environment variable `SLURM_STEP_ID` to values from `0` to `n-1` before each `mpirun` call. By this way `erun` will inform the EARD the correct step ID to be stored then to the DataBase.
-```
+# Job accounting (eacct)
-mpirun -n 4 /path/to/erun --program="myapp" --ear-policy=monitoring --ear-verbose=2
-```
-User commands
+The eacct command shows accounting information stored in the EAR DB for
-------------
+jobs (and step) IDs. The command uses EAR’s configuration file to determine
-The only command available to users is `eacct`. With `eacct` a user can see their previously executed jobs with the information that EAR monitors (time, average power, number of nodes and average frequency among others) and also can use several options to manipulate said output. Some data will not be available if a job is not executed with EARL.
+if the user running it is privileged or not, as non-privileged users can only access
+their information. It provides the following options. The ear module needs to
+be loaded to use the eacct command.
-Note that a user can only see their own applications/jobs unless they are a privileged user and specified as such in the `ear.conf` configuration file.
-For more information, check its [Commands section](Commands#energy-account-eacct).
+```
\ No newline at end of file
+Usage: eacct [Optional parameters]
+        Optional parameters: 
+                -h      displays this message
+                -v      displays current EAR version
+                -b      verbose mode for debugging purposes
+                -u      specifies the user whose applications will be retrieved. Only available to privileged users. [default: all users]
+                -j      specifies the job id and step id to retrieve with the format [jobid.stepid] or the format [jobid1,jobid2,...,jobid_n].
+                                A user can only retrieve its own jobs unless said user is privileged. [default: all jobs]
+                -a      specifies the application names that will be retrieved. [default: all app_ids]
+                -c      specifies the file where the output will be stored in CSV format. [default: no file]
+                -t      specifies the energy_tag of the jobs that will be retrieved. [default: all tags].
+                -l      shows the information for each node for each job instead of the global statistics for said job.
+                -x      shows the last EAR events. Nodes, job ids, and step ids can be specified as if were showing job information.
+                -m      prints power signatures regardless of whether mpi signatures are available or not.
+                -r      shows the EAR loop signatures. Nodes, job ids, and step ids can be specified as if were showing job information.
+                -n      specifies the number of jobs to be shown, starting from the most recent one. [default: 20][to get all jobs use -n all]
+                -f      specifies the file where the user-database can be found. If this option is used, the information will be read from the file and not the database.
+```
+## eacct usage examples
+The basic usage of `eacct` retrieves the last 20 applications (by default) of the
+user executing it. If a user is privileged, it may see all users applications. The
+default behaviour shows data from each job-step, aggregating the values from
+each node in said job-step. If using SLURM as a job manager, a sb (sbatch)
+job-step is created with the data from the entire execution. A specific job may
+be specified with `-j` option:
+- `[user@host EAR]$ eacct` -> Shows last 20 jobs (maximum) executed by the user.
+- `[user@host EAR]$ eacct -j 175966` –> Shows data for jobid = 175966. Metrics are averaged per job.stepid.
+- `[user@host EAR]$ eacct -j 175966.0` –> Shows data for jobid = 175966 stepid=0. Metrics are averaged per job.stepid.
+- `[user@host EAR]$ eacct -j 175966,175967,175968` –> Shows data for jobid= 175966, 175967, 175968 Metrics are averaged per job.stepid.
+*eacct* shows a pre-selected set of columns. Some flags sligthly modifies the set
+of columns reported:
+- JOB-STEP: JobID and Step ID. sb is shown for the sbatch.
+- USER: Username who executed the job.
+- APP=APPLICATION: Job’s name or executable name if job name is not provided.
+- POLICY: Energy optimization policy name (MO = Monitoring).
+- NODES: Number of nodes which ran the job.
+- AVG/DEF/IMC(GHz): Average CPU frequency, default frequency and average uncore frequency. Includes all the nodes for the step. In KHz.
+- TIME(s): Step execution time, in seconds.
+- POWER: Average node power including all the nodes, in Watts.
+- GBS: CPU Main memory bandwidth (GB/second). Hint for CPU/Memory bound classification.
+- CPI: CPU Cycles per Instruction. Hint for CPU/Memory bound classification.
+- ENERGY(J): Accumulated node energy. Includes all the nodes. In Joules.
+- GFLOPS/WATT : CPU GFlops per Watt. Hint for energy efficiency.
+- IO(MBs) : IO (read and write) Mega Bytes per second.
+- MPI% : Percentage of MPI time over the total execution time. It’s the average including all the processes and nodes.
+- GPU metrics
+    - G-POW (T/U) : Average GPU power. Accumulated per node and average of all the nodes.
+        - T = Total (GPU power consumed even if the process is not using them).
+        - U = GPUs used by the job.
+    - G-FREQ : Average GPU frequency. Per node and average of all the nodes.
+    - G-UTIL(G/MEM) : GPU utilization and GPU memory utilization.
+For node-specific information, the `-l` option provides detailed accounting of each
+individual node:
+- `[user@host EAR]$ eacct -j 175966 -l` –> Shows per-node data for jobid=175966.
+- `[user@host EAR]$ eacct -j 175966.0 -l` –> Shows per-node data for jobid=175966, stepid=0.
+One additional column is shown: the VPI. The VPI is the percentage of AVX512 instructions over the total number of instructions.
+For runtime data (EAR loops) one may retrieve them with `-r`. Both Job Id and Step Id filtering works:
+- `[user@host EAR]$ eacct -j 175966.1 -r` –> shows metrics reported at runtime by the EAR library for jobid=175966 , stepid=1.
+To easily transfer eacct’s output, `-c` option saves it in .csv format. Both aggregated and detailed accountings are available, as well as filtering:
+- `[user@host EAR]$ eacct -j 175966 -c test.csv` –> adds to file test.csv all the metrics in EAR DB for jobid=175966. Metrics are averaged per application.
+- `[user@host EAR]$ eacct -j 175966.1 -c -l test.csv` –> adds to file test.csv all the metrics in EAR DB for jobid=175966, stepid= 1. Metrics are per-node.
+- `[user@host EAR]$ eacct -j 175966.1 -c -r test.csv` –> adds to file test.csv all the metrics in EAR DB for jobid=175966, stepid= 1. Metrics are per loop and node.
+When using the `-c` option, all the metrics available in the EAR DB are reported.
+# Jobs executed without the EAR library: Basic Job accounting
+EAR library is automatically loaded with some programming models (MPI,
+MKL, OpenMP and CUDA). For applications not executed with the EARL loaded
+-for example, when srun is not used or programming models or applications not loaded by default by the EAR library-
+EAR provides a default monitoring. In this case a subset of metrics will be reported. In particular:
+- accumulated DC energy(J)
+- accumulated DRAM energy(J)
+- accumulated CPU PCK energy(J)
+- EDP
+- maximum DC power detected(W)
+- minimum DC power detected(W)
+- execution time (in sec)
+- CPU average frequency (kHz)
+- CPU default frequency(KHz).
+DC node energy includes the CPU and GPU energy if there are.
+These metrics are reported per node and jobid and stepid, so they can be seen per job and job and step when using eacct.
\ No newline at end of file