|
|
# Report plugins
|
|
|
EAR reporting system is designed to fit any requirement to store all data collected by its components.
|
|
|
By this way, EAR includes several report plug-ins that are used to send data to various services.
|
|
|
|
|
|
EAR includes several report plugins that are used to send data to various services:
|
|
|
# Overview
|
|
|
|
|
|
* EARD/EARDBD: this plugins are used internally by EAR to send data between services, which in turn will aggregate it and send it to the configured databases or other services.
|
|
|
* MySQL/PostgreSQL: both plugins implement full EAR job and system accounting, using both the official C bindings to send the data to the database. For more information on the database structure, see [the corresponding section](EAR-Database.md)
|
|
|
* Prometheus: this plugin exposes system monitoring data in OpenMetrics format, which is fully compatible with Prometheus. For information about how to compile and set it up, check the [Prometheus section](#prometheus-report plugin).
|
|
|
* csv_ts: reports loop and application data to a .csv file. The structure is the same as `eacct`'s CSV option (see [`eacct`](EAR-commands#ear-job-accounting-eacct)) with an added column for the timestamp.
|
|
|
* EXAMON: sends application accounting and system metrics to EXAMON. For more information, see its [dedicated section](#examon).
|
|
|
* DCDB: sends application accounting and system metrics to DCDB. For more information, see its [dedicated section](#dcdb).
|
|
|
* sysfs: exposes system monitoring data through the file system. For more information, see its [dedicated section](#sysfs-report-plugin).
|
|
|
The reporting system is implemented by an internal API used by EAR components to report data at specific events/stages, and the report plug-in used by each one can be set at `ear.conf` file.
|
|
|
The [Node Manager](Configuration#eard-configuration), the [Database Manager](Configuration#eardbd-configuration), the [Job Manager](Configuration#earl-configuration) and the [Global Manager](Configuration#eargm-configuration) are those configurable components.
|
|
|
The EAR Job Manager differs from other components since it lets the user to choose other plug-ins at job submission time.
|
|
|
Check out how at the [Environment variables](EAR-environment-variables#ear_report_add) section.
|
|
|
|
|
|
## Prometheus report plugin
|
|
|
Plug-ins are compiled as shared objects and are located at `$EAR_INSTALL_PATH/lib/plugins/report`.
|
|
|
Below there is a list of the report plug-ins distributed with the official EAR software.
|
|
|
|
|
|
### Requirements
|
|
|
| Report plug-in name | Description |
|
|
|
| --- | --- |
|
|
|
| eard.so | Reports data to the EAR Node Manager. Then, it is up to the daemon to report the data as it was configured. This plug-in was mainly designed to be used by the EAR Job Manager. |
|
|
|
| eardbd.so | Reports data to the EAR Database Manager. Then, it is up to this service to report the data as it was configured. This plug-in was mainly designed to be used by the EAR Node Manager. |
|
|
|
| mysql.so | Reports data to a MySQL database using the official C bindings. This plug-in was first designed to be used by the EAR Database Manager. |
|
|
|
| psql.so | Reports data to a PosgreSQL database using the official C bindings. This plug-in was first designed to be used by the EAR Database Manager. |
|
|
|
| [prometheus.so](#prometheus-report-plugin) | This plug-in exposes system monitoring data in OpenMetrics format, which is fully compatible with Prometheus. |
|
|
|
| [examon.so](#examon) | Sends application accounting and system metrics to EXAMON. |
|
|
|
| [dcdb.so](#dcdb) | Sends application accounting and system metrics to DCDB. |
|
|
|
| [sysfs.so](#sysfs-report-plugin) | Exposes system monitoring data through the file system. |
|
|
|
| [csv\_ts.so](#csv) | Reports loop and application data to a CSV file. It is the report plug-in loaded when a user sets [`--ear-user-db`](User-guide#ear-job-submission-flags) flag at submission time. |
|
|
|
|
|
|
The Prometheus plugin has only one dependency, [microhttpd](https://www.gnu.org/software/libmicrohttpd/). To be able
|
|
|
to compile it make sure that it is in your LD_LIBRARY_PATH.
|
|
|
# Prometheus report plugin
|
|
|
|
|
|
### Installation
|
|
|
## Requirements
|
|
|
|
|
|
The Prometheus plugin has only one dependency, [microhttpd](https://www.gnu.org/software/libmicrohttpd/).
|
|
|
To be able to compile it make sure that it is in your LD\_LIBRARY\_PATH.
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
Currently, to compile and install the prometheus plugin one has the run the following command.
|
|
|
|
... | ... | @@ -28,7 +41,7 @@ make FEAT_DB_PROMETHEUS=1 install |
|
|
|
|
|
With that, the plugin will be correctly placed in the usual folder.
|
|
|
|
|
|
### Configuration
|
|
|
## Configuration
|
|
|
|
|
|
Due to the way in which Prometheus works, this plugin is designed to be used by the EAR Daemons, although the EARDBD should
|
|
|
not have many issues running it too.
|
... | ... | @@ -43,11 +56,11 @@ This will expose the metrics on each node on a small HTTP server. You can access |
|
|
In Prometheus, simply add the nodes you want to scrape in prometheus.yml with the port 9011.
|
|
|
Make sure that the scrape interval is equal or shorter than the insertion time (`NodeDaemonPowermonFreq` in ear.conf) since metrics only stay in the page for that duration.
|
|
|
|
|
|
## Examon
|
|
|
# Examon
|
|
|
|
|
|
ExaMon (Exascale Monitoring) is a lightweight monitoring framework for supporting accurate monitoring of power/energy/thermal and architectural parameters in distributed and large-scale high-performance computing installations.
|
|
|
|
|
|
### Compilation and installation
|
|
|
## Compilation and installation
|
|
|
|
|
|
To compile the EXAMON plugin you need a functioning EXAMON installation.
|
|
|
|
... | ... | @@ -77,58 +90,54 @@ Once that is set up, you can compile EAR normally and the plugin will be install |
|
|
|
|
|
The plugin is designed to be used locally in each node (EARD level) together with EXAMON's data broker.
|
|
|
|
|
|
## DCDB
|
|
|
# DCDB
|
|
|
The Data Center Data Base (DCDB) is a modular, continuous, and holistic monitoring framework targeted at HPC environments.
|
|
|
|
|
|
This plugin implements the functions to report periodic metrics, report loops, and report events.
|
|
|
|
|
|
When the DCDB plugin is loaded the collected EAR data per report type are stored into a shared memory which is accessed by DCDB ear sensor (report plugin implemented on the DCDB side) to collect the data and push them into the database using MQTT messages.
|
|
|
|
|
|
### Compilation and configuration
|
|
|
## Compilation and configuration
|
|
|
This plugin is automatically installed with the default EAR installation. To activate it, set it as one of the values in the `EARDReportPlugins` of `ear.conf` and restart the EARD.
|
|
|
|
|
|
The plugin is designed to be used locally in each node (EARD level) with the DCDB collect agent.
|
|
|
|
|
|
## Sysfs Report Plugin
|
|
|
# Sysfs Report Plugin
|
|
|
This is a new report plugin to write EAR collected data into a file. Single file is generated per metric per jobID & stepID per node per island per cluster. Only the last collected data metrices are stored into the files, means every time the report runs it saves the current collected values by overwriting the pervious data.
|
|
|
|
|
|
### Namespace Format
|
|
|
## Namespace Format
|
|
|
|
|
|
The below schema has been followed to create the metric files:
|
|
|
```/root_directory/cluster/island/nodename/avg/metricFile
|
|
|
```
|
|
|
/root_directory/cluster/island/nodename/avg/metricFile
|
|
|
/root_directory/cluster/island/nodename/current/metricFile
|
|
|
/root_directory/cluster/island/jobs/jobID/stepID/nodename/avg/metricFile
|
|
|
/root_directory/cluster/island/jobs/jobID/stepID/nodename/current/metricFile
|
|
|
```
|
|
|
The root_directory is the default path where all the created metric files are generated.
|
|
|
The root\_directory is the default path where all the created metric files are generated.
|
|
|
|
|
|
The cluster, island and nodename will be replaced by the island number, cluster name, and node information.
|
|
|
|
|
|
`metricFile` will be replaced by the name of the metrics collected by EAR.
|
|
|
|
|
|
|
|
|
### Metric File Naming Format
|
|
|
The naming format used to create the metric files is implementing the standard sysfs interface format. The current commonly used schema of file naming is:
|
|
|
|
|
|
```<type>_<component>_<metric-name>_<unit>```
|
|
|
|
|
|
## Metric File Naming Format
|
|
|
The naming format used to create the metric files is implementing the standard sysfs interface format.
|
|
|
The current commonly used schema of file naming is `<type>_<component>_<metric-name>_<unit>`.
|
|
|
|
|
|
Numbering is used with some metric files if the component has more than one instance like FLOPS counters or GPU data.
|
|
|
|
|
|
Examples of some generated metric files:
|
|
|
- dc_power_watt
|
|
|
- app_sig_pck_power_watt
|
|
|
- app_sig_mem_gbs
|
|
|
- app_sig_flops_6
|
|
|
- avg_imc_freq_KHz
|
|
|
|
|
|
- dc\_power\_watt
|
|
|
- app\_sig\_pck\_power\_watt
|
|
|
- app\_sig\_mem\_gbs
|
|
|
- app\_sig\_flops\_6
|
|
|
- avg\_imc\_freq\_KHz
|
|
|
|
|
|
### Metrics reported
|
|
|
## Metrics reported
|
|
|
|
|
|
The following are the reported values for each type of metric recorded by ear:
|
|
|
- report_periodic_metrics
|
|
|
- Average values
|
|
|
- The frequency and temperature values have been calculated by summing the values of all periods since the report loaded until the current period and divide it by the total number of periods.
|
|
|
- report_periodic_metrics
|
|
|
- Average values
|
|
|
- The frequency and temperature values have been calculated by summing the values of all periods since the report loaded until the current period and divide it by the total number of periods.
|
|
|
- The energy value is accumulated value of all the periods since the report loaded until the current one.
|
|
|
- The path to those metric files built as: /root_directory/cluster/island/nodename/avg/metricFile
|
|
|
- Current values
|
... | ... | @@ -147,6 +156,74 @@ The following are the reported values for each type of metric recorded by ear: |
|
|
- Represent the current collected EAR metric pere event.
|
|
|
- The path to those metric files built as: /root_directory/cluster/island/jobs/jobID/stepID/nodename/current/metricFile
|
|
|
|
|
|
``` Note: If the cluster contains GPUs, both report_loops and report_applications will generate new schema files will per GPU which contain all the collected data for each GPU with the paths below:
|
|
|
◦ /root_directory/cluster/island/jobs/jobID/stepID/nodename/current/GPU-ID/metricFile
|
|
|
◦ /root_directory/cluster/island/jobs/jobID/stepID/nodename/avg/GPU-ID/metricFile |
|
|
> Note: If the cluster contains GPUs, both report_loops and report_applications will generate new schema files will per GPU which contain all the collected data for each GPU with the paths below:
|
|
|
> - /root\_directory/cluster/island/jobs/jobID/stepID/nodename/current/GPU-ID/metricFile
|
|
|
> - /root\_directory/cluster/island/jobs/jobID/stepID/nodename/avg/GPU-ID/metricFile
|
|
|
|
|
|
# CSV
|
|
|
|
|
|
This plug-in reports both application and loop signatures in CSV format.
|
|
|
Note that the latter can only be reported if the application is running with the EAR Job Manager.
|
|
|
Fields are separated by semi-colons (i.e., _;_).
|
|
|
This plug-in is the one loaded by default when a user sets [`--ear-user-db`](User-guide#ear-job-submission-flags) submission flag.
|
|
|
|
|
|
By default output files are named `ear_app_log.<nodename>.time.csv` and `ear_app_log.<nodename>.time.loops.csv` for applications and loops, respectively.
|
|
|
This behaviour can be changed by exporting `EAR_USER_DB_PATHNAME` environment variable.
|
|
|
Therefore, output files are `<env var value>.<nodename>.time.csv` for application signatures and `<env var value>.<nodename>.time.loops.csv` for loop signatures.
|
|
|
|
|
|
> When setting `--ear-user-db=something` flag at submission time, the batch scheduler plug-in sets this environment variable for you.
|
|
|
|
|
|
The following table describes **application signature file fields**:
|
|
|
|
|
|
| Field | Description | Format |
|
|
|
| --- | --- | --- |
|
|
|
| NODENAME | The short node name the following signature belongs to. | string |
|
|
|
| JOBID | The Job ID the following signature belongs to. | integer |
|
|
|
| STEPID | The Step ID the following signature belongs to. | integer |
|
|
|
| APPID | The Application ID the following signature belongs to. | integer |
|
|
|
| USERID | The user owning the application. | string |
|
|
|
| GROUPID | The main group the user owning the application belongs to. | string |
|
|
|
| JOBNAME | The name of the application being runned. In SLURM systems, this value honours `SLURM_JOB_NAME` environment variable. Otherwise, it is the executable program name. | string |
|
|
|
| USER\_ACC | This is the account of the user which ran the application. Only supported in SLURM systems. | string |
|
|
|
| ENERGY\_TAG | The energy tag requested with the application (see `ear.conf`). | string |
|
|
|
| POLICY | The Job Manager optimization policy executed (if applies). | string |
|
|
|
| POLICY\_TH | The power policy threshold used (if applies). | real |
|
|
|
| START\_TIME | The timestamp of the beginning of the application, expressed in seconds since EPOCH. | integer |
|
|
|
| END\_TIME | The timestamp of the application ending, expressed in seconds since EPOCH. | integer |
|
|
|
| START\_DATE | The date of the beginning of the application, expressed in %+4Y-%m-%d %X. | string |
|
|
|
| END\_DATE | The date of the application ending, expressed in %+4Y-%m-%d %X. | string |
|
|
|
| AVG\_CPUFREQ\_KHZ | The average CPU frequency across all CPUs used by the application, in kHz. | integer |
|
|
|
| AVG\_IMCFREQ\_KHZ | The average IMC frequency during the application execution, in kHz. | integer |
|
|
|
| DEF\_FREQ\_KHZ | The default CPU frequency set at the start of the application, in kHz. | integer |
|
|
|
| TIME\_SEC | The total execution time of the application, in seconds. | integer |
|
|
|
| CPI | The Cycles per Instruction retrieved across all application processes. | real |
|
|
|
| TPI | Transactions to the main memory per Instruction retrieved .| real |
|
|
|
| MEM\_GBS | The memory bandwidth of the application, in GB/s. | real |
|
|
|
| IO\_MBS | The accumulated I/O bandwidth of the application processes, in MB/s. | real |
|
|
|
| PERC\_MPI| The average percentage of time spent in MPI calls across all application processes, in %. | real
|
|
|
| DC\_NODE\_POWER\_W | The average DC node power consumption in the node consumed by the application, in Watts. | real |
|
|
|
| DRAM\_POWER\_W | The average DRAM power consumption in the node consumed by the application. | real |
|
|
|
| PCK\_POWER\_W | The average package power consumption in the node consumed by the application | real |
|
|
|
| CYCLES | The total cycles consumed by the application, accumulated across all its processes. | integer |
|
|
|
| INSTRUCTIONS | The total number of instructions retrieved, accumulated across all its processes. | integer |
|
|
|
| CPU-GFLOPS | The total number of GFLOPS retrieved, accumulated across all its processes. | real |
|
|
|
| GPU*i*\_POWER\_W | The average power consumption of the *i*th GPU in the node. | real |
|
|
|
| GPU*i*\_FREQ\_KHZ | The average frequency of the *i*th GPU in the node. | real |
|
|
|
| GPU*i*\_MEM\_FREQ\_KHZ | The average memory frequency of the *i*th GPU in the node. | real |
|
|
|
| GPU*i*\_UTIL\_PERC | The average GPU *i* utilization. | integer |
|
|
|
| GPU*i*\_MEM\_UTIL\_PERC | The average GPU *i* memory utilization. | integer |
|
|
|
| GPU*i*\_GFLOPS | The total GPU *i* GFLOPS retrieved during the application execution. | real |
|
|
|
| GPU*i*\_TEMP | The average temperature of the *i*th GPU of the node, in celsius. | real |
|
|
|
| GPU*i*\_MEMTEMP | The average memory temperature of the *i*th GPU of the node, in celsius. | real |
|
|
|
| L1\_MISSES | The total numer of L1 cache misses during the application execution. | integer |
|
|
|
| L2\_MISSES | The total numer of L2 cache misses during the application execution. | integer |
|
|
|
| L3\_MISSES | The total numer of L3 cache misses during the application execution. | integer |
|
|
|
| SPOPS\_SINGLE | The total number of floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| SPOPS\_128 | The total number of AVX128 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| SPOPS\_256 | The total number of AVX256 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| SPOPS\_512 | The total number of AVX512 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| DPOPS\_SINGLE | The total number of double precision floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| DPOPS\_128 | The total number of double precision AVX128 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| DPOPS\_256 | The total number of double precision AVX256 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| DPOPS\_512 | The total number of double precision AVX512 floating point operations, accumulated across all processes, retrieved during the application execution. | integer |
|
|
|
| TEMP*i* | The average temperature of the socket *i* during the application execution, in celsius. | real | |