Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • EAR EAR
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Releases
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • EAR_teamEAR_team
  • EAREAR
  • Wiki
  • Report

Report · Changes

Page history
v5.2 release authored Oct 23, 2025 by Oriol's avatar Oriol
Hide whitespace changes
Inline Side-by-side
Report.md
View page @ 168bb2b7
...@@ -21,7 +21,8 @@ Below there is a list of the report plug-ins distributed with the official EAR s ...@@ -21,7 +21,8 @@ Below there is a list of the report plug-ins distributed with the official EAR s
| [examon.so](#examon) | Sends application accounting and system metrics to EXAMON. | | [examon.so](#examon) | Sends application accounting and system metrics to EXAMON. |
| [dcdb.so](#dcdb) | Sends application accounting and system metrics to DCDB. | | [dcdb.so](#dcdb) | Sends application accounting and system metrics to DCDB. |
| [sysfs.so](#sysfs-report-plugin) | Exposes system monitoring data through the file system. | | [sysfs.so](#sysfs-report-plugin) | Exposes system monitoring data through the file system. |
| [csv\_ts.so](#csv) | Reports loop and application data to a CSV file. It is the report plug-in loaded when a user sets [`--ear-user-db`](https://gitlab.bsc.es/ear_team/ear/-/wikis/User-guide#ear-job-submission-flags) flag at submission time. | | [csv\_ts.so](#csv) | Reports loop and application data to a CSV file. It is the report plug-in loaded when a user sets [`--ear-user-db`](https://gitlab.bsc.es/ear_team/ear_private/-/wikis/User-guide#ear-job-submission-flags) flag at submission time. |
| [dcgmi.so](#dcgmi) | Reports loop and application data to a CSV file. It differs from the [csv\_ts.so](#csv) plugin since it also reports NVIDIA DCGM metrics collected by the EAR Library. |
# Prometheus report plugin # Prometheus report plugin
...@@ -177,21 +178,28 @@ The following table describes **application signature file fields**: ...@@ -177,21 +178,28 @@ The following table describes **application signature file fields**:
| Field | Description | Format | | Field | Description | Format |
| --- | --- | --- | | --- | --- | --- |
| NODENAME | The short node name the following signature belongs to. | string |
| JOBID | The Job ID the following signature belongs to. | integer | | JOBID | The Job ID the following signature belongs to. | integer |
| STEPID | The Step ID the following signature belongs to. | integer | | STEPID | The Step ID the following signature belongs to. | integer |
| APPID | The Application ID the following signature belongs to. | integer | | APPID | The Application ID the following signature belongs to. | integer |
| USERID | The user owning the application. | string | | USERID | The user owning the application. | string |
| GROUPID | The main group the user owning the application belongs to. | string | | GROUPID | The main group the user owning the application belongs to. | string |
| ACCOUNTID | This is the account of the user which ran the application. Only supported in SLURM systems. | string |
| JOBNAME | The name of the application being runned. In SLURM systems, this value honours `SLURM_JOB_NAME` environment variable. Otherwise, it is the executable program name. | string | | JOBNAME | The name of the application being runned. In SLURM systems, this value honours `SLURM_JOB_NAME` environment variable. Otherwise, it is the executable program name. | string |
| USER\_ACC | This is the account of the user which ran the application. Only supported in SLURM systems. | string |
| ENERGY\_TAG | The energy tag requested with the application (see `ear.conf`). | string | | ENERGY\_TAG | The energy tag requested with the application (see `ear.conf`). | string |
| POLICY | The Job Manager optimization policy executed (if applies). | string | | JOB\_START\_TIME | The timestamp of the beginning of the application, expressed in seconds since EPOCH. | integer |
| POLICY\_TH | The power policy threshold used (if applies). | real | | JOB\_END\_TIME | The timestamp of the application ending, expressed in seconds since EPOCH. | integer |
| START\_TIME | The timestamp of the beginning of the application, expressed in seconds since EPOCH. | integer | | JOB\_EARL\_START\_TIME | The timestamp of the beginning of the application monitored by the EARL, expressed in seconds since EPOCH. | integer |
| END\_TIME | The timestamp of the application ending, expressed in seconds since EPOCH. | integer | | JOB\_EARL\_END\_TIME | The timestamp of the application ending reported by the EARL, expressed in seconds since EPOCH. | integer |
| START\_DATE | The date of the beginning of the application, expressed in %+4Y-%m-%d %X. | string | | START\_DATE | The date of the beginning of the application, expressed in %+4Y-%m-%d %X. | string |
| END\_DATE | The date of the application ending, expressed in %+4Y-%m-%d %X. | string | | END\_DATE | The date of the application ending, expressed in %+4Y-%m-%d %X. | string |
| POLICY | The Job Manager optimization policy executed (if applies). | string |
| POLICY\_TH | The power policy threshold used (if applies). | real |
| JOB\_NPROCS | The number of processes involved in the application. | integer |
| JOB\_TYPE | The job type. | integer |
| JOB\_DEF\_FREQ | The default frequency at which the job started. | integer |
| EARL\_ENABLED | Indicates whether the job-step ran with the EARL enabled. | integer |
| EAR\_LEARNING | Whether the application was run in the [learning phase](Learning-phase). |
| NODENAME | The short node name the following signature belongs to. | string |
| AVG\_CPUFREQ\_KHZ | The average CPU frequency across all CPUs used by the application, in kHz. | integer | | AVG\_CPUFREQ\_KHZ | The average CPU frequency across all CPUs used by the application, in kHz. | integer |
| AVG\_IMCFREQ\_KHZ | The average IMC frequency during the application execution, in kHz. | integer | | AVG\_IMCFREQ\_KHZ | The average IMC frequency during the application execution, in kHz. | integer |
| DEF\_FREQ\_KHZ | The default CPU frequency set at the start of the application, in kHz. | integer | | DEF\_FREQ\_KHZ | The default CPU frequency set at the start of the application, in kHz. | integer |
...@@ -227,3 +235,55 @@ The following table describes **application signature file fields**: ...@@ -227,3 +235,55 @@ The following table describes **application signature file fields**:
| DPOPS\_256 | The total number of double precision AVX256 floating point operations, accumulated across all processes, retrieved during the application execution. | integer | | DPOPS\_256 | The total number of double precision AVX256 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
| DPOPS\_512 | The total number of double precision AVX512 floating point operations, accumulated across all processes, retrieved during the application execution. | integer | | DPOPS\_512 | The total number of double precision AVX512 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
| TEMP*i* | The average temperature of the socket *i* during the application execution, in celsius. | real | | TEMP*i* | The average temperature of the socket *i* during the application execution, in celsius. | real |
| NODEMGR\_DC\_NODE\_POWER\_W | Average node power along the time period, in Watts. This value differs from *DC_NODE_POWER_W* in that it is computed and reported by the [Node Manager](Architecture#ear-node-manager) (the EARD) independently on whether the EARL was enabled. | real |
| NODEMGR\_DRAM\_POWER\_W | Average DRAM power along the time period, in Watts. **Not available on AMD sockets**. This value differs from *DRAM_POWER_W* in that it is computed and reported by the [Node Manager](Architecture#ear-node-manager) (the EARD) independently on whether the EARL was enabled. | real |
| NODEMGR\_PCK\_POWER\_W | Average RAPL package power along the time period, in Watts. This value shows the aggregated power of all sockets in a package. This value differs from *PCK_POWER_W* in that it is computed and reported by the [Node Manager](Architecture#ear-node-manager) (the EARD) independently on whether the EARL was enabled. | real |
| NODEMGR\_MAX\_DC\_POWER\_W| The peak DC node power computed by the Node Manager. | real |
| NODEMGR\_MIN\_DC\_POWER\_W| The minimum DC node power computed by the Node Manager. | real |
| NODEMGR\_TIME\_SEC | Execution time period (in seconds) which comprises the job-step metrics reported by the Node Manager. | real |
| NODEMGR\_AVG\_CPUFREQ\_KHZ | The average CPU frequency computed by the Node Manager during the job-step execution time. | real |
| NODEMGR\_DEF\_FREQ\_KHZ | The default frequency set by the Node Manager when the job-step began. | real |
# DCGMI
This plug-in reports same metrics as the [CSV](#csv).
Additionally, it reports [NVIDIA DCGM profiling metrics](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#metrics) for those NVIDIA GPU devices which support them.
> Since [ear-v5.0](CHANGELOG#ear-50), the EAR Library supports collecting and reporting NVIDIA DCGM profiling metrics for [Ampere](https://www.nvidia.com/en-us/data-center/ampere-architecture/) and [Hopper](https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/) devices. [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) should be supported as well.
Apart from loading the report plug-in, i.e., `export EAR_REPORT_ADD=dcgmi.so`, the EAR Library must have the DCGM monitoring enabled.
This feature is enabled by default unless explicitely set at compile time.
If disabled, you can enable it by setting the `EAR_GPU_DCGMI_ENABLED` environment variable to *1*:
```sh
...
export EAR_GPU_DCGMI_ENABLED=1
export EAR_REPORT_ADD=dcgmi.so
srun --ear=on my_app
```
Below table describes fields reported in the csv file generated by this plug-in.
Please, review [the official documentation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#metrics) for more information about each metric definition.
By default, **EAR just collects a subset of the DCGM metrics** (see below table).
In order to collect all of them, set the `EAR_DCGM_ALL_EVENTS` environment variable to 1.
See the full list of supported metrics:
| Field | Description | Format |
| --- | --- | --- |
| DCGMI\_EVENTS\_COUNT | The number of fields related with DCGM metrics. | integer |
| GPU*i*\_gr\_engine\_active (\*) | Graphics Engine Activity. | real |
| GPU*i*\_sm\_active (\*) | SM Activity. | real |
| GPU*i*\_sm\_occupancy (\*) | SM Occupancy. | real |
| GPU*i*\_tensor\_active | Tensor Activity. | real |
| GPU*i*\_dram\_active | Memory BW Utilization. | real |
| GPU*i*\_fp64\_active | FP64 Engine Activity. | real |
| GPU*i*\_fp32\_active | FP32 Engine Activity. | real |
| GPU*i*\_fp16\_active | FP16 Engine Activity. | real |
| GPU*i*\_pcie\_tx\_bytes (\*) | PCIe Bandwidth (writes). | real |
| GPU*i*\_pcie\_rx\_bytes (\*) | PCIe Bandwidth (reads). | real |
| GPU*i*\_nvlink\_tx\_bytes (\*) | NVLink Bandwidth (writes). | real |
| GPU*i*\_nvlink\_rx\_bytes (\*) | NVLink Bandwidth (reads). | real |
> \* This metric needs to be requested explicitly through `export EAR_DCGM_ALL_EVENTS=1`.
\ No newline at end of file
Clone repository
  • Home
  • User guide
    • Use cases
      • MPI applications
      • Non-MPI applications
      • Other use cases
      • Usage inside Singularity containers
      • Usage through the COMPSs Framework
    • EAR data
      • Post-mortem application data
      • Runtime report plug-ins
      • EARL events
      • MPI stats
      • Paraver traces
      • Grafana
    • Submission flags
    • Examples
    • Job accounting
    • Job energy optimization
  • Tutorials
  • Commands
    • Job accounting (eacct)
    • System energy report (ereport)
    • EAR control (econtrol)
    • Database management
    • erun
    • ear-info
  • Environment variables
    • Support for Intel(R) speed select technology
  • Admin Guide
    • Quick installation guide
    • Installation from RPM
    • Updating
  • Installation from source
  • Architecture/Services
  • High Availability support
  • Configuration
  • Classification strategies
  • Learning phase
  • Plug-ins
  • Powercap
  • Report plug-ins
  • Database
    • Updating the database from previous EAR versions
    • Tables description
  • Supported systems
  • EAR Data Center Monitoring
  • CHANGELOG
  • FAQs
  • Known issues