|
|
[[_TOC_]]
|
|
|
|
|
|
## EARD: Node Manager
|
|
|
## EAR Node Manager
|
|
|
|
|
|
EAR's daemon is a per-node process that provides privileged metrics of each node as well as a periodic power monitoring service. Said periodic power metrics are sent to EAR's database either directly or via the database daemon (see the [configuration page](Configuration)).
|
|
|
The EAR Daemon (EARD) is a per-node process that provides privileged metrics of each node as well as a periodic power monitoring service.
|
|
|
Said periodic power metrics can be sent to EAR's database directly, via the EAR Database Daemon (EARDBD) or by using some of the provided [report plug-ins](Report).
|
|
|
|
|
|
For more information, see [EARD](EARD).
|
|
|
See the [EARDBD](#ear-database-manager) section and the [configuration page](Configuration) for more information about the EAR Database Manager and how to to configure the EARD to send its collected data to it.
|
|
|
|
|
|
## EARDBD: Database Manager
|
|
|
### Overview
|
|
|
|
|
|
The database daemon acts as an intermediate layer between any EAR component that inserts data and the EAR's database, in order to prevent the database server from collapsing due to getting overrun with connections and insert queries.
|
|
|
The node Daemon is the component in charge of providing any kind of services that requires privileged capabilities. Current version is conceived as an external process executed with root privileges.
|
|
|
|
|
|
For more information, see [EARDBD](EARDBD).
|
|
|
The EARD provides the following services, each one covered by one thread:
|
|
|
|
|
|
## EARGMD: Global Manager
|
|
|
- Provides privileged metrics to EARL such as the average frequency, uncore integrated memory controller counters to compute the memory bandwidth, as well as energy metrics (DC node, DRAM and package energy).
|
|
|
- Implements a periodic power monitoring service. This service allows EAR package to control the total energy consumed in the system.
|
|
|
- Offers a remote API used by EARplug, EARGM and EAR commands. This API accepts requests such as get the system status, change policy settings or notify new job/end job events.
|
|
|
|
|
|
EAR's Global Manager Daemon (EARGMD) is a cluster wide component that controls the percentage of the maximum energy consumed.
|
|
|
### Requirements
|
|
|
|
|
|
For more information, see [EARGM](EARGM).
|
|
|
If using the EAR Database as the storage targe, EARD connects with [EARDBD](#ear-database-manager) service, that has to be up before starting the node daemon, otherwise values reported by EARD to be stored in the database, will be lost.
|
|
|
|
|
|
## EARL: The EAR Library
|
|
|
The EAR Library is the core of the EAR package. The EARL offers a lightweight and simple solution to select the optimal frequency for MPI applications at runtime, with multiple power policies each with a different approach to find said frequency. EARL uses the daemon to read performance metrics and to send application data to EAR's database.
|
|
|
### Configuration
|
|
|
|
|
|
For more information about the library itself, see [EARL](EARL). You can also read about [EAR policies](EARL#policies) and the [EAR API](EARL#ear-api) to use EAR as a third party library in you application.
|
|
|
The EAR Daemon uses the `$(EAR_ETC)/ear/ear.conf` file to be configured.
|
|
|
It can be dynamically configured by reloading the service.
|
|
|
|
|
|
Please visit the [EAR configuration file page](Configuration#EARD-configuration) for more information about the options of EARD and other components.
|
|
|
|
|
|
### Execution
|
|
|
|
|
|
To execute this component, these `systemctl` command examples are provided:
|
|
|
|
|
|
- `sudo systemctl start eard` to start the EARD service.
|
|
|
- `sudo systemctl stop eard` to stop the EARD service.
|
|
|
- `sudo systemctl reload eard` to force reloading the configuration of the EARD service.
|
|
|
|
|
|
Log messages are generated during the execution. Use `journalctl` command to see eard message:
|
|
|
|
|
|
- `sudo journalctl -u eard -f`
|
|
|
|
|
|
### Reconfiguration
|
|
|
|
|
|
After executing a `systemctl reload eard` command, not all the EARD options will be dynamically updated. The list of updated variables are:
|
|
|
|
|
|
```
|
|
|
DefaultPstates
|
|
|
NodeDaemonMaxPstate
|
|
|
NodeDaemonVerbose
|
|
|
NodeDaemonPowermonFreq
|
|
|
SupportedPolicies
|
|
|
MinTimePerformanceAccuracy
|
|
|
```
|
|
|
|
|
|
To reconfigure other options such as EARD connection port, coefficients, etc., it must be stopped and restarted again.
|
|
|
Visit the [EAR configuration file page](Configuration#EARD-configuration) for more information about the options of EARD and other components.
|
|
|
|
|
|
|
|
|
## EAR Database Manager
|
|
|
|
|
|
The EAR Database Daemon (EARDBD) acts as an intermediate layer between any EAR component that inserts data and the EAR's Database, in order to prevent the database server from collapsing due to getting overrun with connections and insert queries.
|
|
|
|
|
|
The Database Manager caches records generated by the [EAR Library](#the-ear-library) and the [EARD](#ear-node-manager) in the system and reports it to the centralized database.
|
|
|
It is recommended to run several EARDBDs if the cluster is big enough in order to reduce the number of inserts and connections to the database.
|
|
|
|
|
|
Also, the EARDBD accumulates data during a period of time to decrease the total insertions in the database, helping the performance of big queries.
|
|
|
By now just the energy metrics are available to accumulate in the new metric called energy aggregation.
|
|
|
EARDBD uses periodic power metrics sent by the EARD, the per-node daemon, including job identification details (Job Id and Step Id when executed in a SLURM system).
|
|
|
|
|
|
### Configuration
|
|
|
|
|
|
The EAR Database Daemon uses the `$(EAR_ETC)/ear/ear.conf` file to be configured. It can be dynamically configured by reloading the service.
|
|
|
|
|
|
Please visit the [EAR configuration file page](Configuration#eardbd-configuration) for more information about the options of EARDBD and other components.
|
|
|
|
|
|
### Execution
|
|
|
|
|
|
To execute this component, these `systemctl` command examples are provided:
|
|
|
- `sudo systemctl start eardbd` to start the EARDBD service.
|
|
|
- `sudo systemctl stop eardbd` to stop the EARDBD service.
|
|
|
- `sudo systemctl reload eardbd` to force reloading the configuration of the EARDBD service.
|
|
|
|
|
|
## EAR Global Manager
|
|
|
|
|
|
The EAR Global Manager Daemon (EARGMD) is a cluster wide component offering cluster energy monitoring and capping.
|
|
|
EARGM can work in two modes: manual and automatic.
|
|
|
When running in manual mode, EARGM monitors the total energy consumption, evaluates the percentage of energy consumption over the energy limit set by the admin and reports the cluster status to the DB. When running in automatic mode, apart from evaluating the energy consumption percentage it sends the evaluation to computing nodes. EARDs passes these messages to EARL which re-applies the energy policy with the new settings.
|
|
|
|
|
|
Apart from sending messages and reporting the energy consumption to the DB, EARGM offers additional features to notify the energy consumption: automatic execution of commands is supported and mails can also automatically be sent. Both the command to be executed or the mail address can be defined in the `ear.conf`, where it can also be specified the energy limits, the monitoring period, etc.
|
|
|
|
|
|
EARGM uses periodic aggregated power metrics to efficiently compute the cluster energy consumption.
|
|
|
Aggregated metrics are computed by [EARDBD](#ear-database-manager) based on power metrics reported by [EARD](#ear-node-manager), the per-node daemon.
|
|
|
|
|
|
> __Note__: if you have multiple EARGMs running, only 1 should be used for Energy management. To turn off energy management for a certain EARGM simply set its energy value to 0.
|
|
|
|
|
|
### Power capping
|
|
|
|
|
|
EARGM also includes an optional power capping system. Power capping can work in two different ways:
|
|
|
|
|
|
- Cluster power cap (unlimited): Each EARGM controls the power consumption of the nodes under them by ensuring the global power does not exceed a set value. While the global power is under a percentage of the global value, the nodes run without any cap. If it approaches said value, a message is sent to all nodes to set their powercap to a pre-set value (via max_powercap in the tags section of ear.conf). Should the power go back to a value under the cap, a message is sent again so the nodes run at their default value (unlimited power).
|
|
|
- Fine grained power cap control: Each EARGM controls the power consumption of the nodes under them and redistributes a certain budget between the nodes, allocating more to nodes who need it. It guarantees that any node has its default powercap allocation (defined by the powercap field in the tags section of ear.conf) if it is running an application.
|
|
|
|
|
|
Furthermore, when using fine grained power cap control it is possible to have multiple EARGMs, each controlling a part of the cluster, with (or without) meta-EARGMs redistributing the power allocation of each EARGM depending on the current needs of each part of the cluster. If no meta-EARGMs are specified, the power value each EARGM has will be static.
|
|
|
|
|
|
Meta-EARGMs are NOT compatible with the unlimited cluster powercap mode.
|
|
|
|
|
|
### Configuration
|
|
|
|
|
|
The EAR Global Manager uses the `$(EAR_ETC)/ear/ear.conf` file to be configured. It can be dynamically configured by reloading the service.
|
|
|
|
|
|
Please visit the [EAR configuration file page](Configuration#EARGM-configuration) for more information about the options of EARGM and other components.
|
|
|
|
|
|
Additonally, 2 EARGMs can be used in the same host by declaring the environment variable EARGMID to specify which EARGM configuration each should use. If said variable is not declared, all EARGMs in the same host will read the first entry.
|
|
|
|
|
|
### Execution
|
|
|
|
|
|
To execute this component, these `systemctl` command examples are provided:
|
|
|
- `sudo systemctl start eargmd` to start the EARGM service.
|
|
|
- `sudo systemctl stop eargmd` to stop the EARGM service.
|
|
|
- `sudo systemctl reload eargmd` to force reloading the configuration of the EARGM service.
|
|
|
|
|
|
|
|
|
## The EAR Library
|
|
|
|
|
|
The EAR Library (EARL) is the core of the EAR package.
|
|
|
The Library offers a lightweight and simple solution to select the optimal frequency for applications at runtime, with multiple power policies each with a different approach to find said frequency.
|
|
|
|
|
|
EARL uses the [Daemon](#ear-node-manager) to read performance metrics and to send application data to EAR Database.
|
|
|
|
|
|
### Overview
|
|
|
|
|
|
EARL is dynamically loaded next to the running applications by the [EAR Loader](EAR-Loader).
|
|
|
The Loader detects whether the application is MPI or not.
|
|
|
In case it is MPI, it also detects whether it is Intel or OpenMPI, and it intercepts the MPI symbols through the PMPI interface, and next symbols are saved in order to provide compatibility with MPI or other profiling tools.
|
|
|
The Library is divided in several stages summarized in the following picture:
|
|
|
|
|
|
![](./images/stack.png)
|
|
|
|
|
|
1. Automatic **detection** of application outer loops. This is done by intercepting MPI calls and invoking the Dynamic Application Iterative Structure detector algorithm. **DynAIS** is highly optimized for new Intel architectures, reporting low overhead. For non-MPI applications, EAR implements a time-guided approach.
|
|
|
2. Computation of the **application signature**. Once DynAIS starts reporting iterations for the outer loop, EAR starts to compute the application signature. This signature includes: iteration time, DC power consumption, bandwidth, cycles, instructions, etc. Since the DC power measurements error highly depends on the hardware, EAR automatically detects the hardware characteristics and sets a minimum time to compute the signature in order to minimize the average error.
|
|
|
|
|
|
![](./images/models.png)
|
|
|
|
|
|
The loop signature is used to **classify the application activity** in different phases. The current EAR version supports the following phases for: IO bound, CPU computation and GPU idle, CPU busy waiting and GPU computing, CPU-GPU computation, and CPU computation (for CPU only nodes). For phases including CPU computation, the optimization policy is applied. For other phases, the EAR library implements some predefined CPU/Memory/GPU frequency settings.
|
|
|
|
|
|
3. **Power and performance projection**. EAR has its own performance and power models which requires the application and the system signatures as an input. The system signature is a set of coefficients characterizing each node in the system. They are computed during the learning phase at the EAR configuration step. EAR projects the power used and computing time (performance) of the running application for all the available frequencies in the system. These models are applied to CPU metrics and projects CPU performance and power when varying the CPU frequency. Using these projections the optimization policy can select the optimal CPU memory.
|
|
|
|
|
|
![](./images/projections.png)
|
|
|
|
|
|
4. **Apply** the selected energy optimization policy. EAR includes two power policies to be selected at runtime: _minimize time to solution_ and _minimize energy to solution_, if permitted by the system administrator. At this point, EAR executes the power policy, using the projections computed in the previous phase, and selects the optimal frequency for an application and its particular run. An additional policy, _monitoring only_ can also be used, but in this case no changes to the running frequency will be made but only the computation and storage of the application signature and metrics will be done. The short version of the names is used when submitting jobs (min_energy, min_time, monitoring). Current policies already includes memory frequency selection but in this case it is not based on models, it is a guided search. Check in your installation in the memory frequency optimization is enabled by default. In case the application is MPI, the policies already classifies the processes as balanced or unbalanced. In case they are unbalanced, a per-process CPU frequency is applied.
|
|
|
|
|
|
Some specific configurations are modified when jobs are executed sharing nodes with other jobs.
|
|
|
For example the memory frequency optiization is disabled.
|
|
|
See section [environment variables page](EAR-environment-variables) for more information on how to tune the EAR library optimization using environment variables.
|
|
|
|
|
|
### Configuration
|
|
|
|
|
|
The Library uses the `$(EAR_ETC)/ear.conf` file to be configured.
|
|
|
Please visit the [EAR configuration file page](Configuration#EARL-configuration) for more information about the options of EARL and other components.
|
|
|
|
|
|
EARL receives its specific settings through a shared memory regions initialized by [EARD](#ear-node-manager).
|
|
|
|
|
|
### Usage
|
|
|
|
|
|
For information on how to run applications alongside with EARL read the [User guide](User-guide).
|
|
|
Next section contains more information regarding EAR's optimisation policies.
|
|
|
|
|
|
### Policies
|
|
|
|
|
|
EAR offers three energy policies plugins: `min_energy`, `min_time` and `monitoring`.
|
|
|
The last one is not a power policy, is used just for application monitoring where CPU frequency is not modified (neither memory or GPU frequency).
|
|
|
For application analysis `monitoring`can be used with specific CPU, memory and/or GPU frequencies.
|
|
|
|
|
|
The energy policy is selected by setting the `--ear-policy=policy` option when submitting a SLURM job.
|
|
|
A policy parameter, which is a particular value or threshold depending on the policy, can be set using the flag `--ear-policy-th=value`.
|
|
|
Its default value is defined in the configuration file, for more information check the [configuration page](Configuration) for more information.
|
|
|
|
|
|
#### Plugin `min_energy`
|
|
|
|
|
|
The goal of this policy is to minimise the energy consumed with a limit to the performance degradation. This limit is is set in the SLURM `--ear-policy-th` option or the configuration file. The `min_energy` policy will select the optimal frequency that minimizes energy enforcing (performance degradation <= parameter). When executing with this policy, applications starts at default frequency(specified at ear.conf).
|
|
|
|
|
|
```
|
|
|
PerfDegr = (CurrTime - PrevTime) / (PrevTime)
|
|
|
```
|
|
|
|
|
|
#### Plugin `min_time`
|
|
|
|
|
|
The goal of this policy is to improve the execution time while guaranteeing a minimum ratio between performance benefit and frequency increment that justifies the increased energy consumption from this frequency increment. The policy uses the SLURM parameter option mentioned above as a minimum efficiency threshold.
|
|
|
|
|
|
**Example:** if `--ear-policy-th=0.75`, EAR will prevent scaling to upper frequencies if the ratio between performance gain and frequency gain do not improve at least 75% (PerfGain >= (FreqGain \* threshold).
|
|
|
|
|
|
```
|
|
|
PerfGain=(PrevTime-CurrTime)/PrevTime
|
|
|
FreqGain=(CurFreq-PrevFreq)/PrevFreq
|
|
|
```
|
|
|
|
|
|
When launched with `min_time` policy, applications start at a default frequency (defined at `ear.conf`).
|
|
|
Check the [configuration page](Configuration) for more information.
|
|
|
|
|
|
**Example:** given a system with a nominal frequency of 2.3GHz and default P_STATE set to 3, an application executed with `min_time` will start with frequency `F\\\[i\\\]=2.0Ghz` (3 P_STATEs less than nominal). When application metrics are computed, the library will compute performance projection for `F\\\[i+1\\\]` and will compute the performance_gain as shown in the Figure 1. If performance gain is greater or equal than threshold, the policy will check with the next performance projection `F\\\[i+2\\\]`. If the performance gain computed is less than threshold, the policy will select the last frequency where the performance gain was enough, preventing the waste of energy.
|
|
|
|
|
|
![](./images/min_time_example.png)
|
|
|
|
|
|
Figure 1: `min_time` uses the threshold value as the minimum value for the performance gain between `F\\\[i\\\]` and `F\\\[i+1\\\]`.
|
|
|
|
|
|
### EAR API
|
|
|
|
|
|
EAR offers a user API for applications. The current EAR version only offers two functions, one to read the accumulated energy and time and another to compute the difference between the two measurements.
|
|
|
|
|
|
- `int ear_connect()`
|
|
|
- `int ear_energy(unsigned long \\\*energy_mj, unsigned long \\\*time_ms)`
|
|
|
- `void ear_energy_diff(unsigned long ebegin, unsigned long eend, unsigned long \\\*ediff, unsigned long tbegin, unsigned long tend, unsigned long \\\*tdiff)`
|
|
|
- `int ear_set_cpufreq(cpu_set_t \\\*mask,unsigned long cpufreq);`
|
|
|
- `int ear_set_gpufreq(int gpu_id,unsigned long gpufreq)`
|
|
|
- `int ear_set_gpufreq_list(int num_gpus,unsigned long \\\*gpufreqlist)`
|
|
|
- `void ear_disconnect()`
|
|
|
|
|
|
EAR's header file and library can be found at $EAR_INSTALL_PATH/include/ear.h and $EAR_INSTALL_PATH/lib/libEAR_api.so respectively. The following example reports the energy, time, and average power during that time for a simple loop including a `sleep(5)`.
|
|
|
|
|
|
```
|
|
|
#define _GNU_SOURCE
|
|
|
#include <ear.h>
|
|
|
|
|
|
int main(int argc,char *argv[])
|
|
|
{
|
|
|
unsigned long e_mj=0,t_ms=0,e_mj_init,t_ms_init,e_mj_end,t_ms_end=0;
|
|
|
unsigned long ej,emj,ts,tms,os,oms;
|
|
|
unsigned long ej_e,emj_e,ts_e,tms_e,os_e,oms_e;
|
|
|
int i=0;
|
|
|
struct tm *tstamp,*tstamp2,*tstamp3,*tstamp4;
|
|
|
char s[128],s2[128],s3[128],s4[128];
|
|
|
|
|
|
/* Connecting with ear */
|
|
|
if (ear_connect()!=EAR_SUCCESS)
|
|
|
{
|
|
|
printf("error connecting eard\n");
|
|
|
exit(1);
|
|
|
}
|
|
|
|
|
|
/* Reading energy */
|
|
|
if (ear_energy(&e_mj_init,&t_ms_init)!=EAR_SUCCESS)
|
|
|
{
|
|
|
printf("Error in ear_energy\n");
|
|
|
}
|
|
|
while(i<5)
|
|
|
{
|
|
|
sleep(5);
|
|
|
|
|
|
/* READING ENERGY */
|
|
|
if (ear_energy(&e_mj_end,&t_ms_end)!=EAR_SUCCESS)
|
|
|
{
|
|
|
printf("Error in ear_energy\n");
|
|
|
}
|
|
|
else
|
|
|
{
|
|
|
ts=t_ms_init/1000;
|
|
|
ts_e=t_ms_end/1000;
|
|
|
tstamp=localtime((time_t *)&ts);
|
|
|
strftime(s, sizeof(s), "%c", tstamp);
|
|
|
tstamp2=localtime((time_t *)&ts_e);
|
|
|
strftime(s2, sizeof(s), "%c", tstamp2);
|
|
|
|
|
|
printf("Start time %s End time %s\n",s,s2);
|
|
|
ear_energy_diff(e_mj_init,e_mj_end, &e_mj, t_ms_init,t_ms_end,&t_ms);
|
|
|
printf("Time consumed %lu (ms), energy consumed %lu(mJ),
|
|
|
Avg power %lf(W)\n",t_ms,e_mj,(double)e_mj/(double)t_ms);
|
|
|
e_mj_init=e_mj_end;
|
|
|
t_ms_init=t_ms_end;
|
|
|
}
|
|
|
i++;
|
|
|
}
|
|
|
ear_disconnect();
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## EAR Loader
|
|
|
|
|
|
## EARLo: EAR Loader
|
|
|
The EAR Loader is the responsible for loading the EAR Library.
|
|
|
It is a small and lightweight library loaded by the [EAR SLURM Plugin](#ear-slurm-plugin) that identifies the user application and loads its corresponding EAR Library distribution.
|
|
|
It is a small and lightweight library loaded by the [EAR SLURM Plugin](#ear-slurm-plugin) (through the `LD_PRELOAD` environment variable) that identifies the user application and loads its corresponding EAR Library distribution.
|
|
|
|
|
|
The Loader detects the underlying application, identifying the MPI version (if used) and other minor details.
|
|
|
With this information, the loader opens the suitable EAR Library version.
|
|
|
|
|
|
As can be read in the [EARL](#the-ear-library) page, depending on the MPI vendor the MPI types can be different, preventing any compatibility between distributions.
|
|
|
For example, if the MPI distribution is OpenMPI, the EAR Loader will load the EAR Library compiled with the OpenMPI includes.
|
|
|
|
|
|
For more information, see [EARLo](EAR Loader).
|
|
|
You can read the [installation guide](Admin-guide#quick-installation-guide) for more information about compiling and installing different EARL versions.
|
|
|
|
|
|
## EAR SLURM plugin
|
|
|
|
|
|
EAR SLURM plugin allows to dynamically load and configure the EAR library for the SLURM jobs, if the enabling argument is set or is enabled by default. Additionally, it reports any jobs that start or end to the nodes' EARDs for accounting and monitoring purposes.
|
|
|
EAR SLURM plugin allows to dynamically load and configure the EAR library for the SLURM jobs (and steps), if the flag `--ear=on` is set or if it is enabled by default.
|
|
|
Additionally, it reports any jobs that start or end to the nodes' EARDs for accounting and monitoring purposes.
|
|
|
|
|
|
### Configuration
|
|
|
|
|
|
Visit the [SLURM SPANK plugin section](Configuration#slurm-spank-plugin-configuration-file) on the configuration page to set up properly the SLURM `/etc/slurm/plugstack.conf` file.
|
|
|
|
|
|
For more information, see [SLURM Plugin](EAR SLURM plugin). |
|
|
You can find the complete list of EAR SLURM plugin accpeted parameters in the
|
|
|
[user guide](User-guide#ear-job-submission-flags). |