| ... | ... | @@ -21,7 +21,8 @@ Below there is a list of the report plug-ins distributed with the official EAR s |
|
|
|
| [examon.so](#examon) | Sends application accounting and system metrics to EXAMON. |
|
|
|
|
| [dcdb.so](#dcdb) | Sends application accounting and system metrics to DCDB. |
|
|
|
|
| [sysfs.so](#sysfs-report-plugin) | Exposes system monitoring data through the file system. |
|
|
|
|
| [csv\_ts.so](#csv) | Reports loop and application data to a CSV file. It is the report plug-in loaded when a user sets [`--ear-user-db`](https://gitlab.bsc.es/ear_team/ear/-/wikis/User-guide#ear-job-submission-flags) flag at submission time. |
|
|
|
|
| [csv\_ts.so](#csv) | Reports loop and application data to a CSV file. It is the report plug-in loaded when a user sets [`--ear-user-db`](https://gitlab.bsc.es/ear_team/ear_private/-/wikis/User-guide#ear-job-submission-flags) flag at submission time. |
|
|
|
|
| [dcgmi.so](#dcgmi) | Reports loop and application data to a CSV file. It differs from the [csv\_ts.so](#csv) plugin since it also reports NVIDIA DCGM metrics collected by the EAR Library. |
|
|
|
|
|
|
|
|
# Prometheus report plugin
|
|
|
|
|
| ... | ... | @@ -177,21 +178,28 @@ The following table describes **application signature file fields**: |
|
|
|
|
|
|
|
| Field | Description | Format |
|
|
|
|
| --- | --- | --- |
|
|
|
|
| NODENAME | The short node name the following signature belongs to. | string |
|
|
|
|
| JOBID | The Job ID the following signature belongs to. | integer |
|
|
|
|
| STEPID | The Step ID the following signature belongs to. | integer |
|
|
|
|
| APPID | The Application ID the following signature belongs to. | integer |
|
|
|
|
| USERID | The user owning the application. | string |
|
|
|
|
| GROUPID | The main group the user owning the application belongs to. | string |
|
|
|
|
| ACCOUNTID | This is the account of the user which ran the application. Only supported in SLURM systems. | string |
|
|
|
|
| JOBNAME | The name of the application being runned. In SLURM systems, this value honours `SLURM_JOB_NAME` environment variable. Otherwise, it is the executable program name. | string |
|
|
|
|
| USER\_ACC | This is the account of the user which ran the application. Only supported in SLURM systems. | string |
|
|
|
|
| ENERGY\_TAG | The energy tag requested with the application (see `ear.conf`). | string |
|
|
|
|
| POLICY | The Job Manager optimization policy executed (if applies). | string |
|
|
|
|
| POLICY\_TH | The power policy threshold used (if applies). | real |
|
|
|
|
| START\_TIME | The timestamp of the beginning of the application, expressed in seconds since EPOCH. | integer |
|
|
|
|
| END\_TIME | The timestamp of the application ending, expressed in seconds since EPOCH. | integer |
|
|
|
|
| JOB\_START\_TIME | The timestamp of the beginning of the application, expressed in seconds since EPOCH. | integer |
|
|
|
|
| JOB\_END\_TIME | The timestamp of the application ending, expressed in seconds since EPOCH. | integer |
|
|
|
|
| JOB\_EARL\_START\_TIME | The timestamp of the beginning of the application monitored by the EARL, expressed in seconds since EPOCH. | integer |
|
|
|
|
| JOB\_EARL\_END\_TIME | The timestamp of the application ending reported by the EARL, expressed in seconds since EPOCH. | integer |
|
|
|
|
| START\_DATE | The date of the beginning of the application, expressed in %+4Y-%m-%d %X. | string |
|
|
|
|
| END\_DATE | The date of the application ending, expressed in %+4Y-%m-%d %X. | string |
|
|
|
|
| POLICY | The Job Manager optimization policy executed (if applies). | string |
|
|
|
|
| POLICY\_TH | The power policy threshold used (if applies). | real |
|
|
|
|
| JOB\_NPROCS | The number of processes involved in the application. | integer |
|
|
|
|
| JOB\_TYPE | The job type. | integer |
|
|
|
|
| JOB\_DEF\_FREQ | The default frequency at which the job started. | integer |
|
|
|
|
| EARL\_ENABLED | Indicates whether the job-step ran with the EARL enabled. | integer |
|
|
|
|
| EAR\_LEARNING | Whether the application was run in the [learning phase](Learning-phase). |
|
|
|
|
| NODENAME | The short node name the following signature belongs to. | string |
|
|
|
|
| AVG\_CPUFREQ\_KHZ | The average CPU frequency across all CPUs used by the application, in kHz. | integer |
|
|
|
|
| AVG\_IMCFREQ\_KHZ | The average IMC frequency during the application execution, in kHz. | integer |
|
|
|
|
| DEF\_FREQ\_KHZ | The default CPU frequency set at the start of the application, in kHz. | integer |
|
| ... | ... | @@ -227,3 +235,55 @@ The following table describes **application signature file fields**: |
|
|
|
| DPOPS\_256 | The total number of double precision AVX256 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
|
| DPOPS\_512 | The total number of double precision AVX512 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
|
| TEMP*i* | The average temperature of the socket *i* during the application execution, in celsius. | real |
|
|
|
|
| NODEMGR\_DC\_NODE\_POWER\_W | Average node power along the time period, in Watts. This value differs from *DC_NODE_POWER_W* in that it is computed and reported by the [Node Manager](Architecture#ear-node-manager) (the EARD) independently on whether the EARL was enabled. | real |
|
|
|
|
| NODEMGR\_DRAM\_POWER\_W | Average DRAM power along the time period, in Watts. **Not available on AMD sockets**. This value differs from *DRAM_POWER_W* in that it is computed and reported by the [Node Manager](Architecture#ear-node-manager) (the EARD) independently on whether the EARL was enabled. | real |
|
|
|
|
| NODEMGR\_PCK\_POWER\_W | Average RAPL package power along the time period, in Watts. This value shows the aggregated power of all sockets in a package. This value differs from *PCK_POWER_W* in that it is computed and reported by the [Node Manager](Architecture#ear-node-manager) (the EARD) independently on whether the EARL was enabled. | real |
|
|
|
|
| NODEMGR\_MAX\_DC\_POWER\_W| The peak DC node power computed by the Node Manager. | real |
|
|
|
|
| NODEMGR\_MIN\_DC\_POWER\_W| The minimum DC node power computed by the Node Manager. | real |
|
|
|
|
| NODEMGR\_TIME\_SEC | Execution time period (in seconds) which comprises the job-step metrics reported by the Node Manager. | real |
|
|
|
|
| NODEMGR\_AVG\_CPUFREQ\_KHZ | The average CPU frequency computed by the Node Manager during the job-step execution time. | real |
|
|
|
|
| NODEMGR\_DEF\_FREQ\_KHZ | The default frequency set by the Node Manager when the job-step began. | real |
|
|
|
|
|
|
|
|
# DCGMI
|
|
|
|
|
|
|
|
This plug-in reports same metrics as the [CSV](#csv).
|
|
|
|
Additionally, it reports [NVIDIA DCGM profiling metrics](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#metrics) for those NVIDIA GPU devices which support them.
|
|
|
|
|
|
|
|
> Since [ear-v5.0](CHANGELOG#ear-50), the EAR Library supports collecting and reporting NVIDIA DCGM profiling metrics for [Ampere](https://www.nvidia.com/en-us/data-center/ampere-architecture/) and [Hopper](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/) devices. [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) should be supported as well.
|
|
|
|
|
|
|
|
Apart from loading the report plug-in, i.e., `export EAR_REPORT_ADD=dcgmi.so`, the EAR Library must have the DCGM monitoring enabled.
|
|
|
|
This feature is enabled by default unless explicitely set at compile time.
|
|
|
|
If disabled, you can enable it by setting the `EAR_GPU_DCGMI_ENABLED` environment variable to *1*:
|
|
|
|
|
|
|
|
```sh
|
|
|
|
...
|
|
|
|
|
|
|
|
export EAR_GPU_DCGMI_ENABLED=1
|
|
|
|
export EAR_REPORT_ADD=dcgmi.so
|
|
|
|
srun --ear=on my_app
|
|
|
|
```
|
|
|
|
|
|
|
|
Below table describes fields reported in the csv file generated by this plug-in.
|
|
|
|
Please, review [the official documentation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#metrics) for more information about each metric definition.
|
|
|
|
|
|
|
|
By default, **EAR just collects a subset of the DCGM metrics** (see below table).
|
|
|
|
In order to collect all of them, set the `EAR_DCGM_ALL_EVENTS` environment variable to 1.
|
|
|
|
See the full list of supported metrics:
|
|
|
|
|
|
|
|
| Field | Description | Format |
|
|
|
|
| --- | --- | --- |
|
|
|
|
| DCGMI\_EVENTS\_COUNT | The number of fields related with DCGM metrics. | integer |
|
|
|
|
| GPU*i*\_gr\_engine\_active (\*) | Graphics Engine Activity. | real |
|
|
|
|
| GPU*i*\_sm\_active (\*) | SM Activity. | real |
|
|
|
|
| GPU*i*\_sm\_occupancy (\*) | SM Occupancy. | real |
|
|
|
|
| GPU*i*\_tensor\_active | Tensor Activity. | real |
|
|
|
|
| GPU*i*\_dram\_active | Memory BW Utilization. | real |
|
|
|
|
| GPU*i*\_fp64\_active | FP64 Engine Activity. | real |
|
|
|
|
| GPU*i*\_fp32\_active | FP32 Engine Activity. | real |
|
|
|
|
| GPU*i*\_fp16\_active | FP16 Engine Activity. | real |
|
|
|
|
| GPU*i*\_pcie\_tx\_bytes (\*) | PCIe Bandwidth (writes). | real |
|
|
|
|
| GPU*i*\_pcie\_rx\_bytes (\*) | PCIe Bandwidth (reads). | real |
|
|
|
|
| GPU*i*\_nvlink\_tx\_bytes (\*) | NVLink Bandwidth (writes). | real |
|
|
|
|
| GPU*i*\_nvlink\_rx\_bytes (\*) | NVLink Bandwidth (reads). | real |
|
|
|
|
|
|
|
|
> \* This metric needs to be requested explicitly through `export EAR_DCGM_ALL_EVENTS=1`. |
|
|
\ No newline at end of file |