Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • EAR EAR
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Service Desk
    • Milestones
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Releases
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • EAR_teamEAR_team
  • EAREAR
  • Wiki
  • Architecture

Architecture · Changes

Page history
v5.2 release authored Oct 23, 2025 by Oriol's avatar Oriol
Show whitespace changes
Inline Side-by-side
Architecture.md
View page @ 168bb2b7
......@@ -131,6 +131,12 @@ Furthermore, when using fine grained power cap control it is possible to have mu
Meta-EARGMs are NOT compatible with the unlimited cluster powercap mode.
### Local powercap
EARGM has a local version that can be run without privileges and that controls the power consumption of a list of nodes. This can be used as a rudimentary version for job powercap, where a job with N nodes is not allowed to consume more than a certain amount of power. In the current version, if the allocated power is exceeded a powercap will be applied to all nodes equally (that is, the same amount of power will be allocated to each node, regardless of their actual consumption).
Furthermore, custom scripts may be executed when the power reaches certain thresholds so the user can have more control over what to do.
See the execution section for how to run this mode.
## Configuration
The EAR Global Manager uses the `$(EAR_ETC)/ear/ear.conf` file to be configured. It can be dynamically configured by reloading the service.
......@@ -146,6 +152,18 @@ To execute this component, these `systemctl` command examples are provided:
- `sudo systemctl stop eargmd` to stop the EARGM service.
- `sudo systemctl reload eargmd` to force reloading the configuration of the EARGM service.
To execute a local EARGM with powercap for certain nodes, one may run it as:
```
eargmd --powercap=2000 --nodes=node[0-4] --powercap-policy soft --suspend-perc 90 --suspend-action suspend_action.sh --powercap-period=10 --conf-path=$HOME/ear_install/etc/ear/ear.conf
```
This will execute an EARGM that will control nodes node[0-4], apply a total powercap of 2000W with a soft powercap policy (that is, the application will run as normal unless the aggregated power of all 5 nodes reaches 2000W, at which point a power limit of 400 per node will be applied).
`suspend-perc` indicates the percentage of power to reach for `suspend-action` to be executed; in this case, when power reaches 1800W suspend_action.sh will be called _once_. A reciprocal of this exists, called `resume-perc` and `resume-action` which will only be called once resume-perc power has been reached *AND* `suspend-action` has been called.
Finally, `powercap-period` sets the time between polls for power from the nodes (how often the EARGM checks the current power consumption), and `conf-path` specifies a custom ear.conf file.
For more information, one can run `eargmd --help`.
# The EAR Library (Job Manager)
......@@ -178,7 +196,7 @@ The loop signature is used to **classify the application activity** in different
Some specific configurations are modified when jobs are executed sharing nodes with other jobs.
For example the memory frequency optiization is disabled.
See section [environment variables page](https://gitlab.bsc.es/ear_team/ear/-/wikis/EAR-environment-variables) for more information on how to tune the EAR library optimization using environment variables.
See section [environment variables page](https://gitlab.bsc.es/ear_team/ear_private/-/wikis/EAR-environment-variables) for more information on how to tune the EAR library optimization using environment variables.
## Configuration
......@@ -189,188 +207,28 @@ EARL receives its specific settings through a shared memory regions initialized
## Usage
For information on how to run applications alongside with EARL read the [User guide](https://gitlab.bsc.es/ear_team/ear/-/wikis/User-guide).
For information on how to run applications alongside with EARL read the [User guide](https://gitlab.bsc.es/ear_team/ear_private/-/wikis/User-guide).
Next section contains more information regarding EAR's optimisation policies.
## Classification
In the context of the Library's pipeline, phase classification is the module that, given the last computed application signature, undertakes the task of identifying the type of activity of the application, thereby giving hints to optimize its execution. By approaching the application signature as a semantic expression of this activity, the classification allows for guiding (or even skipping, if possible) subsequent steps of the pipeline.
In the EARL's pipeline, classification is a step that optimizes the power and performance projections of the energy models. The idea behind it is that, upon identifying the type of activity of applications, we can adapt its execution according to the architecture's exploited resources.
As explained in the [EAR Library](Architecture#overview-2) section, EAR accounts for different execution phases, which can be separated into **CPU computation** and **non-CPU computation**. Bear in mind that, for now, CPU computation phases only take into account the activity of the CPU, and not that of the GPU. Now, since the optimization policy is applied for these execution phases, and it depends on the projections made by the energy models, the classification optimizes these projections by indicating if they have to try increasing or decreasing pstates.
In EAR, CPU computation phases include
- **_CPU-bound_ phases**, which are characterized by intensive usage of the CPU for calculus-related operations (normally measured through the GFLOPS and CPI)
- **_MEMORY-bound_ phases**, characterized by the intensity of calls to (main) memory (normally measured through MEM_GBS and TPI)
- **_MIX_ phases**, which are in-between these two
Given the complexity of correctly making distinctions between these execution phases, the classification becomes a step that requires an adequate strategy to tackle it. Thus, in this section we will present the strategies proposed & implemented in the EARL.
### Default model
EAR's default classification model is based on the application of a fixed set of ranges to CPI and MEM_GBS metrics. These ranges, defined according to the architecture's characteristics via expert knowledge, would allow identifying the different execution phases on a fundamental level.
#### Input
This strategy, available since EAR's installation, takes 4 thresholds, 2 for delimiting the CPI and memory bandwidth (i.e., MEM_GBS) of CPU-bound applications, and 2 more for delimiting those of MEMORY-bound ones. For instance, the values proposed for [*Sapphire Rapids*](https://www.intel.com/content/www/us/en/products/sku/231746/intel-xeon-platinum-8480-processor-105m-cache-2-00-ghz/specifications.html) type of node are
- CPI of CPU-bound apps: 0.4
- MEM_GBS of CPU-bound apps: 180
- CPI of MEMORY-bound apps: 0.4
- MEM_GBS of MEMORY-bound apps: 250
#### Classification philosophy
With these thresholds defined, the classification proceeds as follows
```
Let S be the last registered signature
Let CPU be the struct with the CPU-bound-related thresholds
Let MEM be the struct with the MEMORY-bound-related thresholds
IF (S->CPI <= CPU->CPI && S->MEM_GBS <= CPU->MEM_GBS)
Mark app as CPU-bound
ELSE IF (S->CPI >= MEM->CPI && S->MEM_GBS >= MEM->MEM_GBS)
Mark app as MEMORY-bound
ELSE
Mark app as MIX
```
Let us go over the cases considered by the strategy:
1. To begin with, the model checks if the application is CPU-bound by checking if the CPI and MEM_GBS of the application are _below_ the CPU thresholds. The sign of this comparison is due to the fact that we expect a CPU-bound application to be executing lots of instructions (thus having a small CPI) and not bounded by memory (thus registering a low memory bandwidth).
2. If the strategy finds that the app is not CPU-bound, it checks whether the considered metrics are _above_ the MEM thresholds. The sign of this comparison is due to the fact that we expect a MEMORY-bound app to be using a considerable amount of memory bandwidth (thus having a big MEM_GBS) and not executing too many instructions (thus having a big CPI aswell).
3. If none of these conditions are met, we label the app as MIX.
The strength of this approach is its quickness and effectiveness, given that the classification is based upon the performance typically expected from the execution phases considered.
### Roofline model
The roofline model combines floating point performance, memory traffic and operational intensity to characterize the activity of an application based on the performance limitations of the hardware <A href="#roofline">[2]</A>. For EAR, it allows for identifying execution phase types in a simple & quick way in runtime.
#### Input
To use this strategy in EAR, we need the floating point performance and memory traffic peaks, which can be computed either theoretically or empirically.
##### Theoretical roofline
To get the theoretical peaks, we propose the usage of equations <A href="#peak-flops">(1)</A> <A href="#peak-gbs">(2)</A> as follows:
![equations.png](images/equations.png)
**NOTE**: it is usually assumed that.
```
- #FMA units= 2
- bytes/cycle=8
- flops/cycle=16
```
To give an idea of how to apply these equations, let us give an example on how we would apply them on [EPYC 9654](https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9654.html) nodes:
![examples.png](images/examples.png)
##### Empirical roofline
To get the empirical peaks, it is suggested to use the STREAM benchmark [[1]](#stream) and the HPL benchmark [[5]](#hpl) for harvesting the memory bandwidth floating point performance peaks, respectively.
---
Once the peaks have been properly computed, we store them in a file of the form `roofline.{tag}.data`, where `tag` corresponds to the tag of the node partition, which will depend on the environment the user is working on (check [Tags](Configuration#tags) for more information). The format to store the peaks is
```shell
{peak memory bandwidth} {peak floating performance}
```
In summary, it is only required a plain text file, with '`roofline.{architecture}.data` as its filename, and only containing both peaks ordered. Moreover, this file is expected to be stored in the `$EAR_ETC/ear/coeffs` directory; check the [EARL configuration section](Configuration#earl-configuration) for more details on this.
Following the EPYC 9654 example, we would create a file called `roofline.epyc9654.data` containing
```shell
921.6 22732.8
```
#### Classification strategy
Once EAR has access to the roofline peaks, the classification proceeds as follows
```
Let S be the last registered signature
Let PEAK be the struct with the peaks of the architecture
IF (S->GFLOPS / S->MEM_GBS >= PEAK->GFLOPS / PEAK->MEM_GBS)
Mark app as CPU-bound
ELSE IF (S->MEM_GBS >= PEAK->MEM_GBS * 0.75)
Mark app as MEMORY-bound
ELSE
Mark app as MIX
```
Let us go over it step by step:
1. We begin by checking if the app is _CPU-bound_, which is equivalent to checking if the operational intensity is bigger than the threshold defined by the peaks of the architecture.
2. If the app is not _CPU-bound_, we check if the app has a memory bandwidth close enough (in absolute units) to the peak of the architecture. To do so, we measure it by defining a “similarity threshold” (in our example, 0.75). If it is close enough, the app is labelled as _MEMORY-bound_.
3. If the app is neither of them, then it is labelled as _MIX_.
This way, we adapt the original roofline model to include the _MIX_ execution phase while maintaining its classification philosophy.
### K-medoids
K-medoids is a clustering method based in the k-means <A href="#k-means">[6]</A>, but instead of using _centroids_ as representatives of the classes, it uses _medoids_ (i.e., elements of the dataset) <A href="#k-medoids">[7]</A>. This strategy's usage is encouraged once EAR has access to enough user data to train such a model. Also, it is flexible enough to be regenerated over time, thus accounting for changes in the type of applications executed by users as well as becoming more robust over time.
Furthermore, to achieve a balance between classification quality and runtime performance, the model used by EAR defines medoids as subsets of signature metrics conformed by CPI, TPI, GFLOPS and MEM_GBS. This way, it combines metrics both from the architecture and the application's computational activity.
Finally, it is of interest to note that, to ensure that the classification is properly conducted, the data is previously standardized <A href="#standardize">[8]</A>.
#### Input
To use this strategy, two plain text files are needed: one for the standardization, and another one for the medoids. Regarding the first one, it stores the mean and standard deviations in the following format:
```shell
{CPI std} {CPI mean} {TPI std} {TPI mean} {GFLOPS std} {GFLOPS mean} {MEM_GBS std} {MEM_GBS mean}
```
This first file must follow the format name `extremes.{tag}.data`, where `tag` corresponds to the tag of the node partition.
Now, regarding the second file, it stores the medoids in the following format:
```shell
{CPU-bound CPI} {CPU-bound TPI} {CPU-bound GFLOPS} {CPU-bound MEM_GBS} {MEMORY-bound CPI} {MEMORY-bound TPI} {MEMORY-bound GFLOPS} {MEMORY-bound MEM_GBS} {MIX CPI} {MIX TPI} {MIX GFLOPS} {MIX MEM_GBS}
```
This file must also follow a format name `medoids.{tag}.data`, where `tag` corresponds to the tag of the node partition.
**NOTE**: the `tag` will depend on the environment the user is working on. Check [Tags](Configuration#tags) for more information on this.
**NOTE 2**: as in the _roofline_ case, it is expected that both `medoids` and `extremes` files are stored in the `$EAR_ETC/ear/coeffs` directory. Check the [EARL configuration section](Configuration#earl-configuration) for more details on this.
#### Classification strategy
The classification proceeds as follows
```
Let S be the last signature registered by EAR
Let V := [S->CPI, S->TPI, S->GFLOPS, S->MEM_GBS]
Standardize vector V
Check Euclidean distance from V to the medoids and label signature accordingly
```
Let us go over the steps followed by this pseudocode:
1. For each loop signature registered, we standardize its CPI, TPI, GFLOPS and MEM_GBS. To do so, EAR preloads the extremes (i.e., mu and sigma of each metric) at the beginning of the execution.
2. Once processed, we check the Euclidean distance of this vector to the different medoids, which are also preloaded (just like extremes).
3. With these distances computed, we identify the current execution phase of the app by checking which one is closer to the signature.
With these steps, EAR's classification becomes more flexible and equally robust.
---
As a final note, the library chooses the classification strategy in the following way:
```
IF medoids are available
Load K-medoids model
ELSE IF roofline is available
Load roofline model
ELSE
Load default model
```
Thus, the user must be aware not only of the availability of each strategy, but also how the library prioritizes some models over others.
### References
<a name="stream">[1]</a> McCalpin, John. “STREAM: Sustainable memory bandwidth in high performance computers.” _http://www.cs.virginia.edu/stream/_ (2006).
<a name="roofline">[2]</a> Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. _Communications of the ACM_, _52_(4), 65-76.
<a name="peak-flops">[3]</a> Andreolli, C., Thierry, P., Borges, L., Skinner, G., Yount, C., Jeffers, J., & Reinders, J. (2015). Characterization and optimization methodology applied to stencil computations. _High Performance Parallelism Pearls_, 377-396.
<a name="peak-gbs">[4]</a> Bakos, J. D. (2016). Multicore and data-level optimization. In Embedded Systems (pp. 49–103). Elsevier. https://doi.org/10.1016/b978-0-12-800342-8.00002-x
<a name="hpl">[5]</a> Petitet, Antoine. “HPL-a portable implementation of the high-performance Linpack benchmark for distributed-memory computers.” _http://www.netlib.org/benchmark/hpl/_ (2004).
<a name="k-means">[6]</a> Blömer, J., Lammersen, C., Schmidt, M., & Sohler, C. (2016). Theoretical analysis of the k-means algorithm–a survey. _Algorithm Engineering: Selected Results and Surveys_, 81-116.
Taking this activity as what we call an *execution phase*, EAR currently accounts for the following types:
- **Computational phases**, which focus on the activity of the CPU in a way that is useful for the Job Manager's pipeline. This type includes execution phases, such as:
- <ins>CPU-bound</ins>: intensity of calculus-related operations, as measured by the Cycles per Instruction (or `CPI`) and Floating-point operations (or `GFLOPS`).
- <ins>Memory-bound</ins>: intensity of accesses to (main) memory (as measured by `MEM_GBS` in GB/s), and the memory Transactions per Instruction (or `TPI`).
- <ins>Mix</ins>: intensity distributed between both calculus-related operations, and accesses to main memory.
- **Non-computational phases**, which focus on the activity of the CPU in a way that allows for applying pre-defined optimizations to the application. This type includes the following execution phases:
- <ins>CPU busy wait</ins>: intensive usage of the CPU due to an active wait
- <ins>IO-bound</ins>: intensive usage of input-output channels
- <ins>MPI-bound</ins>: presence of lots of MPI calls
<a name="k-medoids">[7]</a> Park, H. S., & Jun, C. H. (2009). A simple and fast algorithm for K-medoids clustering. _Expert systems with applications_, _36_(2), 3336-3341.
Due to the fact that identifying computational phases correctly can optimize the Library's pipeline, and given making solid distinctions between execution phases is a complex task, the classification strategy becomes a key element of the main action loop.
Currently, EAR incorporates three different strategies:
- **Default strategy**: EAR's default classification model is based on setting predefined ranges of values for `CPI` and `MEM_GBS` metrics. These ranges, defined according to the architecture's characteristics via expert knowledge, allow identifying the different execution phases on a fundamental level, and are available since EAR's installation.
- **Roofline strategy**: this approach is based on the roofline model, which conducts bottleneck analysis of the architecture's peak floating point performance and memory traffic to characterize the activity of any application. This strategy becomes available once the peaks for both resources have been computed, and allows for identifying execution phase types in a simple and quick way in runtime.
- **K-medoids strategy**: this approach is based on the classification offered by the k-medoids clustering method, originally derived from k-means. The strategy, which becomes available once EAR has enough data in the database, allows for a more flexible classification than that of previous strategies, while also allowing for regeneration over time, as needed.
<a name="standardize">[8]</a> Gal, M. S., & Rubinfeld, D. L. (2019). Data standardization. _NYUL Rev._, _94_, 737.
## Policies
......@@ -418,7 +276,7 @@ It is a small and lightweight library loaded by the [EAR SLURM Plugin](#ear-slur
The Loader detects the underlying application, identifying the MPI version (if used) and other minor details.
With this information, the loader opens the suitable EAR Library version.
As can be read in the [EARL](#the-ear-library) page, depending on the MPI vendor the MPI types can be different, preventing any compatibility between distributions.
As can be read in the [EARL](#the-ear-library-job-manager) page, depending on the MPI vendor the MPI types can be different, preventing any compatibility between distributions.
For example, if the MPI distribution is OpenMPI, the EAR Loader will load the EAR Library compiled with the OpenMPI includes.
You can read the [installation guide](Admin-guide#quick-installation-guide) for more information about compiling and installing different EARL versions.
......@@ -430,7 +288,7 @@ Additionally, it reports any jobs that start or end to the nodes' EARDs for acco
## Configuration
Visit the [SLURM SPANK plugin section](Configuration#slurm-spank-plugin-configuration-file) on the configuration page to set up properly the SLURM `/etc/slurm/plugstack.conf` file.
Visit the [SLURM SPANK plugin section](Configuration#slurm-spank-plug-in-configuration-file) on the configuration page to set up properly the SLURM `/etc/slurm/plugstack.conf` file.
You can find the complete list of EAR SLURM plugin accpeted parameters in the
[user guide](https://gitlab.bsc.es/ear_team/ear/-/wikis/User-guide#ear-job-submission-flags).
......
Clone repository
  • Home
  • User guide
    • Use cases
      • MPI applications
      • Non-MPI applications
      • Other use cases
      • Usage inside Singularity containers
      • Usage through the COMPSs Framework
    • EAR data
      • Post-mortem application data
      • Runtime report plug-ins
      • EARL events
      • MPI stats
      • Paraver traces
      • Grafana
    • Submission flags
    • Examples
    • Job accounting
    • Job energy optimization
  • Tutorials
  • Commands
    • Job accounting (eacct)
    • System energy report (ereport)
    • EAR control (econtrol)
    • Database management
    • erun
    • ear-info
  • Environment variables
    • Support for Intel(R) speed select technology
  • Admin Guide
    • Quick installation guide
    • Installation from RPM
    • Updating
  • Installation from source
  • Architecture/Services
  • High Availability support
  • Configuration
  • Classification strategies
  • Learning phase
  • Plug-ins
  • Powercap
  • Report plug-ins
  • Database
    • Updating the database from previous EAR versions
    • Tables description
  • Supported systems
  • EAR Data Center Monitoring
  • CHANGELOG
  • FAQs
  • Known issues