... | ... | @@ -8,7 +8,7 @@ This section introduces all of these components and how they are stacked to prov |
|
|
## System power consumption and job accounting
|
|
|
|
|
|
This is the most basic feature.
|
|
|
EAR is able to collect node power consumption and report it periodically thanks to the [EAR Node Manager](#ear-node-manager), a Linux service which runs on each compute node.
|
|
|
EAR is able to collect node power consumption and report it periodically thanks to the [EAR Node Manager](#ear-node-manager)(EARD), a Linux service which runs on each compute node.
|
|
|
Is up to the sysadmin to decide how and where its periodic metrics are [reported](Configuration#eard-configuration).
|
|
|
The following figure shows this scheme.
|
|
|
|
... | ... | @@ -20,7 +20,7 @@ Currently, EAR distribution comes with a [SLURM SPANK plug-in](#ear-slurm-plugin |
|
|
## Application performance monitoring and energy efficiency optimization
|
|
|
|
|
|
Along with applications running in compute nodes, a runtime library can be loaded dynamically (thanks again to the batch scheduler support).
|
|
|
The [EAR Job Manager](#the-ear-library-job-manager) runs within application/workflow processes, so it can collect performance metrics, which can be reported in the same way as with the Node Manager, but still configurable.
|
|
|
The [EAR Job Manager](#the-ear-library-job-manager)(EARL) runs within application/workflow processes, so it can collect performance metrics, which can be reported in the same way as with the Node Manager, but still configurable.
|
|
|
Moreover, the Job Manager comes with optimization policies, which can select the optimal CPU/IMC/GPU frequencies based on those performance metrics by contacting with the Node Manager.
|
|
|
Below figure shows the interaction between these two components.
|
|
|
|
... | ... | @@ -28,16 +28,16 @@ Below figure shows the interaction between these two components. |
|
|
|
|
|
# EAR Node Manager
|
|
|
|
|
|
The EAR Daemon (EARD) is a per-node linux service that provides privileged metrics of each node as well as a periodic power monitoring service.
|
|
|
The Node Manager (EARD) is a per-node linux service that provides privileged metrics of each node as well as a periodic power monitoring service.
|
|
|
Said periodic power metrics can be sent to EAR's database directly, via the EAR Database Daemon (EARDBD) or by using some of the provided [report plug-ins](Report).
|
|
|
|
|
|
See the [EARDBD](#ear-database-manager) section and the [configuration page](Configuration) for more information about the EAR Database Manager and how to to configure the EARD to send its collected data to it.
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
The node Daemon is the component in charge of providing any kind of services that requires privileged capabilities. Current version is conceived as an external process executed with root privileges.
|
|
|
EARD is the component in charge of providing any kind of services that requires privileged capabilities. Current version is conceived as an external process executed with root privileges.
|
|
|
|
|
|
The EARD provides the following services, each one covered by one thread:
|
|
|
It provides the following services, each one covered by one thread:
|
|
|
|
|
|
- Provides privileged metrics to EARL such as the average frequency, uncore integrated memory controller counters to compute the memory bandwidth, as well as energy metrics (DC node, DRAM and package energy).
|
|
|
- Implements a periodic power monitoring service. This service allows EAR package to control the total energy consumed in the system.
|
... | ... | @@ -192,6 +192,186 @@ EARL receives its specific settings through a shared memory regions initialized |
|
|
For information on how to run applications alongside with EARL read the [User guide](User-guide).
|
|
|
Next section contains more information regarding EAR's optimisation policies.
|
|
|
|
|
|
## Classification
|
|
|
|
|
|
In the EARL's pipeline, classification is a step that optimizes the power and performance projections of the energy models. The idea behind it is that, upon identifying the type of activity of applications, we can adapt its execution according to the architecture's exploited resources.
|
|
|
|
|
|
As explained in the [EAR Library](Architecture#overview-2) section, EAR accounts for different execution phases, which can be separated into **CPU computation** and **non-CPU computation**. Bear in mind that, for now, CPU computation phases only take into account the activity of the CPU, and not that of the GPU. Now, since the optimization policy is applied for these execution phases, and it depends on the projections made by the energy models, the classification optimizes these projections by indicating if they have to try increasing or decreasing pstates.
|
|
|
|
|
|
In EAR, CPU computation phases include
|
|
|
- **_CPU-bound_ phases**, which are characterized by intensive usage of the CPU for calculus-related operations (normally measured through the GFLOPS and CPI)
|
|
|
- **_MEMORY-bound_ phases**, characterized by the intensity of calls to (main) memory (normally measured through MEM_GBS and TPI)
|
|
|
- **_MIX_ phases**, which are in-between these two
|
|
|
|
|
|
Given the complexity of correctly making distinctions between these execution phases, the classification becomes a step that requires an adequate strategy to tackle it. Thus, in this section we will present the strategies proposed & implemented in the EARL.
|
|
|
|
|
|
### Default model
|
|
|
|
|
|
EAR's default classification model is based on the application of a fixed set of ranges to CPI and MEM_GBS metrics. These ranges, defined according to the architecture's characteristics via expert knowledge, would allow identifying the different execution phases on a fundamental level.
|
|
|
|
|
|
#### Input
|
|
|
|
|
|
This strategy, available since EAR's installation, takes 4 thresholds, 2 for delimiting the CPI and memory bandwidth (i.e., MEM_GBS) of CPU-bound applications, and 2 more for delimiting those of MEMORY-bound ones. For instance, the values proposed for [*Sapphire Rapids*](https://www.intel.com/content/www/us/en/products/sku/231746/intel-xeon-platinum-8480-processor-105m-cache-2-00-ghz/specifications.html) type of node are
|
|
|
- CPI of CPU-bound apps: 0.4
|
|
|
- MEM_GBS of CPU-bound apps: 180
|
|
|
- CPI of MEMORY-bound apps: 0.4
|
|
|
- MEM_GBS of MEMORY-bound apps: 250
|
|
|
|
|
|
#### Classification philosophy
|
|
|
|
|
|
With these thresholds defined, the classification proceeds as follows
|
|
|
|
|
|
```
|
|
|
Let S be the last registered signature
|
|
|
Let CPU be the struct with the CPU-bound-related thresholds
|
|
|
Let MEM be the struct with the MEMORY-bound-related thresholds
|
|
|
IF (S->CPI <= CPU->CPI && S->MEM_GBS <= CPU->MEM_GBS)
|
|
|
Mark app as CPU-bound
|
|
|
ELSE IF (S->CPI >= MEM->CPI && S->MEM_GBS >= MEM->MEM_GBS)
|
|
|
Mark app as MEMORY-bound
|
|
|
ELSE
|
|
|
Mark app as MIX
|
|
|
```
|
|
|
|
|
|
Let us go over the cases considered by the strategy:
|
|
|
1. To begin with, the model checks if the application is CPU-bound by checking if the CPI and MEM_GBS of the application are _below_ the CPU thresholds. The sign of this comparison is due to the fact that we expect a CPU-bound application to be executing lots of instructions (thus having a small CPI) and not bounded by memory (thus registering a low memory bandwidth).
|
|
|
2. If the strategy finds that the app is not CPU-bound, it checks whether the considered metrics are _above_ the MEM thresholds. The sign of this comparison is due to the fact that we expect a MEMORY-bound app to be using a considerable amount of memory bandwidth (thus having a big MEM_GBS) and not executing too many instructions (thus having a big CPI aswell).
|
|
|
3. If none of these conditions are met, we label the app as MIX.
|
|
|
|
|
|
The strength of this approach is its quickness and effectiveness, given that the classification is based upon the performance typically expected from the execution phases considered.
|
|
|
|
|
|
### Roofline model
|
|
|
The roofline model combines floating point performance, memory traffic and operational intensity to characterize the activity of an application based on the performance limitations of the hardware <A href="#roofline">[2]</A>. For EAR, it allows for identifying execution phase types in a simple & quick way in runtime.
|
|
|
|
|
|
#### Input
|
|
|
To use this strategy in EAR, we need the floating point performance and memory traffic peaks, which can be computed either theoretically or empirically.
|
|
|
|
|
|
##### Theoretical roofline
|
|
|
To get the theoretical peaks, we propose the usage of equations <A href="#peak-flops">(1)</A> <A href="#peak-gbs">(2)</A> as follows:
|
|
|
|
|
|
![equations.png](images/equations.png)
|
|
|
|
|
|
**NOTE**: it is usually assumed that.
|
|
|
|
|
|
``
|
|
|
- #FMA units= 2
|
|
|
- bytes/cycle=8
|
|
|
- flops/cycle=16
|
|
|
```
|
|
|
|
|
|
To give an idea of how to apply these equations, let us give an example on how we would apply them on [EPYC 9654](https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9654.html) nodes:
|
|
|
|
|
|
![examples.png](images/examples.png)
|
|
|
|
|
|
##### Empirical roofline
|
|
|
To get the empirical peaks, it is suggested to use the STREAM benchmark [[1]](#stream) and the HPL benchmark [[5]](#hpl) for harvesting the memory bandwidth floating point performance peaks, respectively.
|
|
|
|
|
|
---
|
|
|
|
|
|
Once the peaks have been properly computed, we store them in a file of the form `roofline.{tag}.data`, where `tag` corresponds to the tag of the node partition, which will depend on the environment the user is working on (check [Tags](Configuration#tags) for more information). The format to store the peaks is
|
|
|
```shell
|
|
|
{peak memory bandwidth} {peak floating performance}
|
|
|
```
|
|
|
In summary, it is only required a plain text file, with '`roofline.{architecture}.data` as its filename, and only containing both peaks ordered. Moreover, this file is expected to be stored in the `$EAR_ETC/ear/coeffs` directory; check the [EARL configuration section](Configuration#earl-configuration) for more details on this.
|
|
|
|
|
|
Following the EPYC 9654 example, we would create a file called `roofline.epyc9654.data` containing
|
|
|
```shell
|
|
|
921.6 22732.8
|
|
|
```
|
|
|
|
|
|
#### Classification strategy
|
|
|
|
|
|
Once EAR has access to the roofline peaks, the classification proceeds as follows
|
|
|
```
|
|
|
Let S be the last registered signature
|
|
|
Let PEAK be the struct with the peaks of the architecture
|
|
|
IF (S->GFLOPS / S->MEM_GBS >= PEAK->GFLOPS / PEAK->MEM_GBS)
|
|
|
Mark app as CPU-bound
|
|
|
ELSE IF (S->MEM_GBS >= PEAK->MEM_GBS * 0.75)
|
|
|
Mark app as MEMORY-bound
|
|
|
ELSE
|
|
|
Mark app as MIX
|
|
|
```
|
|
|
|
|
|
Let us go over it step by step:
|
|
|
1. We begin by checking if the app is _CPU-bound_, which is equivalent to checking if the operational intensity is bigger than the threshold defined by the peaks of the architecture.
|
|
|
2. If the app is not _CPU-bound_, we check if the app has a memory bandwidth close enough (in absolute units) to the peak of the architecture. To do so, we measure it by defining a “similarity threshold” (in our example, 0.75). If it is close enough, the app is labelled as _MEMORY-bound_.
|
|
|
3. If the app is neither of them, then it is labelled as _MIX_.
|
|
|
|
|
|
This way, we adapt the original roofline model to include the _MIX_ execution phase while maintaining its classification philosophy.
|
|
|
|
|
|
### K-medoids
|
|
|
K-medoids is a clustering method based in the k-means <A href="#k-means">[6]</A>, but instead of using _centroids_ as representatives of the classes, it uses _medoids_ (i.e., elements of the dataset) <A href="#k-medoids">[7]</A>. This strategy's usage is encouraged once EAR has access to enough user data to train such a model. Also, it is flexible enough to be regenerated over time, thus accounting for changes in the type of applications executed by users as well as becoming more robust over time.
|
|
|
|
|
|
Furthermore, to achieve a balance between classification quality and runtime performance, the model used by EAR defines medoids as subsets of signature metrics conformed by CPI, TPI, GFLOPS and MEM_GBS. This way, it combines metrics both from the architecture and the application's computational activity.
|
|
|
|
|
|
Finally, it is of interest to note that, to ensure that the classification is properly conducted, the data is previously standardized <A href="#standardize">[8]</A>.
|
|
|
|
|
|
#### Input
|
|
|
|
|
|
To use this strategy, two plain text files are needed: one for the standardization, and another one for the medoids. Regarding the first one, it stores the mean and standard deviations in the following format:
|
|
|
```shell
|
|
|
{CPI std} {CPI mean} {TPI std} {TPI mean} {GFLOPS std} {GFLOPS mean} {MEM_GBS std} {MEM_GBS mean}
|
|
|
```
|
|
|
This first file must follow the format name `extremes.{tag}.data`, where `tag` corresponds to the tag of the node partition.
|
|
|
|
|
|
Now, regarding the second file, it stores the medoids in the following format:
|
|
|
```shell
|
|
|
{CPU-bound CPI} {CPU-bound TPI} {CPU-bound GFLOPS} {CPU-bound MEM_GBS} {MEMORY-bound CPI} {MEMORY-bound TPI} {MEMORY-bound GFLOPS} {MEMORY-bound MEM_GBS} {MIX CPI} {MIX TPI} {MIX GFLOPS} {MIX MEM_GBS}
|
|
|
```
|
|
|
This file must also follow a format name `medoids.{tag}.data`, where `tag` corresponds to the tag of the node partition.
|
|
|
|
|
|
**NOTE**: the `tag` will depend on the environment the user is working on. Check [Tags](Configuration#tags) for more information on this.
|
|
|
|
|
|
**NOTE 2**: as in the _roofline_ case, it is expected that both `medoids` and `extremes` files are stored in the `$EAR_ETC/ear/coeffs` directory. Check the [EARL configuration section](Configuration#earl-configuration) for more details on this.
|
|
|
|
|
|
#### Classification strategy
|
|
|
The classification proceeds as follows
|
|
|
```
|
|
|
Let S be the last signature registered by EAR
|
|
|
Let V := [S->CPI, S->TPI, S->GFLOPS, S->MEM_GBS]
|
|
|
Standardize vector V
|
|
|
Check Euclidean distance from V to the medoids and label signature accordingly
|
|
|
```
|
|
|
|
|
|
Let us go over the steps followed by this pseudocode:
|
|
|
1. For each loop signature registered, we standardize its CPI, TPI, GFLOPS and MEM_GBS. To do so, EAR preloads the extremes (i.e., mu and sigma of each metric) at the beginning of the execution.
|
|
|
2. Once processed, we check the Euclidean distance of this vector to the different medoids, which are also preloaded (just like extremes).
|
|
|
3. With these distances computed, we identify the current execution phase of the app by checking which one is closer to the signature.
|
|
|
|
|
|
With these steps, EAR's classification becomes more flexible and equally robust.
|
|
|
|
|
|
---
|
|
|
|
|
|
As a final note, the library chooses the classification strategy in the following way:
|
|
|
```
|
|
|
IF medoids are available
|
|
|
Load K-medoids model
|
|
|
ELSE IF roofline is available
|
|
|
Load roofline model
|
|
|
ELSE
|
|
|
Load default model
|
|
|
```
|
|
|
Thus, the user must be aware not only of the availability of each strategy, but also how the library prioritizes some models over others.
|
|
|
|
|
|
### References
|
|
|
|
|
|
<a name="stream">[1]</a> McCalpin, John. “STREAM: Sustainable memory bandwidth in high performance computers.” _http://www.cs.virginia.edu/stream/_ (2006).
|
|
|
|
|
|
<a name="roofline">[2]</a> Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. _Communications of the ACM_, _52_(4), 65-76.
|
|
|
|
|
|
<a name="peak-flops">[3]</a> Andreolli, C., Thierry, P., Borges, L., Skinner, G., Yount, C., Jeffers, J., & Reinders, J. (2015). Characterization and optimization methodology applied to stencil computations. _High Performance Parallelism Pearls_, 377-396.
|
|
|
|
|
|
<a name="peak-gbs">[4]</a> Bakos, J. D. (2016). Multicore and data-level optimization. In Embedded Systems (pp. 49–103). Elsevier. https://doi.org/10.1016/b978-0-12-800342-8.00002-x
|
|
|
|
|
|
<a name="hpl">[5]</a> Petitet, Antoine. “HPL-a portable implementation of the high-performance Linpack benchmark for distributed-memory computers.” _http://www.netlib.org/benchmark/hpl/_ (2004).
|
|
|
|
|
|
<a name="k-means">[6]</a> Blömer, J., Lammersen, C., Schmidt, M., & Sohler, C. (2016). Theoretical analysis of the k-means algorithm–a survey. _Algorithm Engineering: Selected Results and Surveys_, 81-116.
|
|
|
|
|
|
<a name="k-medoids">[7]</a> Park, H. S., & Jun, C. H. (2009). A simple and fast algorithm for K-medoids clustering. _Expert systems with applications_, _36_(2), 3336-3341.
|
|
|
|
|
|
<a name="standardize">[8]</a> Gal, M. S., & Rubinfeld, D. L. (2019). Data standardization. _NYUL Rev._, _94_, 737.
|
|
|
|
|
|
## Policies
|
|
|
|
|
|
EAR offers three energy policies plugins: `min_energy`, `min_time` and `monitoring`.
|
... | ... | @@ -255,6 +435,12 @@ Visit the [SLURM SPANK plugin section](Configuration#slurm-spank-plugin-configur |
|
|
You can find the complete list of EAR SLURM plugin accpeted parameters in the
|
|
|
[user guide](User-guide#ear-job-submission-flags).
|
|
|
|
|
|
# EAR Data Center Monitor
|
|
|
|
|
|
It is a new EAR service for Data Center monitoring.
|
|
|
In particular, it targets elements different than computational nodes which are already monitored by the EARD running in compute nodes.
|
|
|
It has a [dedicated section](EDCMON) you can read for more information.
|
|
|
|
|
|
|
|
|
# EAR application API
|
|
|
|
... | ... | |