|
|
|
|
|
|
|
As laid out in the [Classification](Architecture#classification) section of the Library, EAR incorporates three different strategies. This page will be dedicated to explaining how do they work internally, how and when do they become available to the end user.
|
|
|
|
|
|
|
|
## Default model
|
|
|
|
|
|
|
|
EAR's default classification model is based on the application of a fixed set of thresholds (in absolute units) to `CPI` and `MEM_GBS` metrics. These thresholds, available since EAR's installation, would allow identifying the different execution phases on a fundamental level.
|
|
|
|
|
|
|
|
### Input
|
|
|
|
|
|
|
|
This strategy takes 4 thresholds, 2 for characterizing CPU-bound applications, and 2 for MEMORY-bound ones. In turn, these are defined according to the architecture's specifications via domain knowledge of administrators. For instance, the values proposed for [*Sapphire Rapids*](https://www.intel.com/content/www/us/en/products/sku/231746/intel-xeon-platinum-8480-processor-105m-cache-2-00-ghz/specifications.html) type of node are
|
|
|
|
- CPU-bound `CPI`: 0.4
|
|
|
|
- CPU-bound `MEM_GBS`: 180
|
|
|
|
- MEMORY-bound `CPI`: 0.4
|
|
|
|
- MEMORY-bound `MEM_GBS`: 250
|
|
|
|
|
|
|
|
### Classification philosophy
|
|
|
|
|
|
|
|
With these thresholds defined, the classification proceeds as follows:
|
|
|
|
|
|
|
|
```
|
|
|
|
Let S be the last registered signature
|
|
|
|
Let <CPU_CPI, CPU_BWD> be the tuple of CPU-bound thresholds
|
|
|
|
Let <MEM_CPI, MEM_BWD> be the tuple of MEMORY-bound thresholds
|
|
|
|
IF (S->CPI <= CPU_CPI && S->MEM_GBS <= CPU_BWD)
|
|
|
|
Mark app as CPU-bound
|
|
|
|
ELSE IF (S->CPI >= MEM_CPI && S->MEM_GBS >= MEM_BWD)
|
|
|
|
Mark app as MEMORY-bound
|
|
|
|
ELSE
|
|
|
|
Mark app as MIX
|
|
|
|
```
|
|
|
|
|
|
|
|
Let us go over the idiosincracy of the current strategy:
|
|
|
|
1. To begin with, the model checks if the the `CPI` and `MEM_GBS` of the application are _below_ the CPU-bound thresholds. The sign of this comparison is given by the fact that we expect a CPU-bound application to be executing lots of instructions (thus having a small CPI) and not bounded by memory (thus registering a low memory bandwidth).
|
|
|
|
2. If the app is not CPU-bound, the strategy checks whether the considered metrics are _above_ the MEMORY-bound thresholds. The sign of this comparison is given by the fact that we expect a MEMORY-bound app to be using a considerable amount of memory bandwidth (thus having a big MEM_GBS) and not executing too many instructions (thus having a big CPI aswell).
|
|
|
|
3. If none of these conditions are met, the strategy labels the app as MIX.
|
|
|
|
|
|
|
|
The strength of this approach is its quickness and effectiveness, given that the classification is based upon the performance to be typically expected from the execution phases considered. However, this strategy is limited by its rigidness: the fact that (1) it works with a small amount of thresholds, and (2) in absolute values, leads to a mostly rigid (although not incorrect) classification of the applications' activity.
|
|
|
|
|
|
|
|
## Roofline model
|
|
|
|
|
|
|
|
The roofline model combines floating point performance and memory traffic to characterize the activity of an application based on the performance limitations of the hardware <A href="#roofline">[2]</A>. For EAR, it allows for identifying execution phase types in a simple and quick way in runtime via bottleneck analysis.
|
|
|
|
|
|
|
|
### Input
|
|
|
|
|
|
|
|
To use this strategy in EAR, we need the floating point performance and memory traffic peaks, which can be computed either theoretically or empirically. The EAR team offers its own tool, called [ear-roofline](https://gitlab.bsc.es/ear_team/ear-roofline), which allows for computing the theoretical peaks automatically, according to the architecture's specifications, and to do so, it makes use of the following equations:
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
To give an idea of how to apply these equations, let us give an example over [EPYC 9654](https://www.amd.com/en/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9654.html) nodes:
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
Once the peaks have been properly computed, these are stored in a file of the form `roofline.{tag}.data`, where `tag` corresponds to the tag of the targeted architecture as defined in the `ear.conf` file, which will depend on the environment the user is working on (check [Tags](Configuration#tags) for more information). The format of the stored peaks is
|
|
|
|
```shell
|
|
|
|
{peak memory bandwidth} {peak floating performance}
|
|
|
|
```
|
|
|
|
|
|
|
|
Following the EPYC 9654 example, we would create a file called `roofline.epyc9654.data` containing
|
|
|
|
```shell
|
|
|
|
921.6 22732.8
|
|
|
|
```
|
|
|
|
|
|
|
|
Moreover, this file is expected to be stored in the `$EAR_ETC/ear/coeffs/island{X}` directory; check the [EARL configuration](Configuration#earl-configuration) and the [Island description](Configuration#island-description) sections for more details on this. There exists a way to circumvent this requirement, however: as indicated in the [Paths modification](EAR-research-envvars#paths-modification) section, there's an environment variable called `HACK_ROOFLINE_COEFF_PATH` which allows hacking the path to the coefficients file. Note that **this variable only hacks the path, and not the expected filename**, whose format still has to be complied with.
|
|
|
|
|
|
|
|
### Classification strategy
|
|
|
|
|
|
|
|
Once EAR has access to the roofline peaks, the classification proceeds as follows
|
|
|
|
```
|
|
|
|
Let S be the last registered signature
|
|
|
|
Let <PEAK_FLOPS, PEAK_GBS> be the tuple of peaks of the architecture
|
|
|
|
IF (S->GFLOPS / S->MEM_GBS >= PEAK_FLOPS / PEAK_GBS)
|
|
|
|
Mark app as CPU-bound
|
|
|
|
ELSE IF (S->MEM_GBS >= PEAK_GBS * 0.75)
|
|
|
|
Mark app as MEMORY-bound
|
|
|
|
ELSE
|
|
|
|
Mark app as MIX
|
|
|
|
```
|
|
|
|
|
|
|
|
Let us go over it step by step:
|
|
|
|
1. The strategy begins by checking if the operational intensity (i.e., the ratio between flops and bytes) is bigger than the threshold defined by the peaks of the architecture. If so, the app is marked as _CPU-bound_.
|
|
|
|
2. If the app is not _CPU-bound_, the strategy checks if the app has a memory bandwidth close enough (in absolute units) to the peak memory bandwidth of the architecture via a threshold (which, by default, is experimentally set to 0.75). If the application meets this condition, it is labelled as _MEMORY-bound_.
|
|
|
|
3. If None of both conditions are met, then it is labelled as _MIX_.
|
|
|
|
|
|
|
|
This way, the original roofline model is adapted to include the _MIX_ execution phase while maintaining its classification philosophy, thus allowing to identify execution phases with a quickness and effectiveness similar to the initial classification strategy. Although this strategy is more robust (given that it is based on bottleneck analysis), it is virtually just as rigid as the previous one.
|
|
|
|
|
|
|
|
## K-medoids
|
|
|
|
K-medoids is a clustering method based in the k-means <A href="#k-means">[6]</A> algorithm, but instead of using _centroids_ as representatives of the classes, it uses _medoids_ (i.e., elements of the dataset) <A href="#k-medoids">[7]</A>. This strategy, which becomes available once EAR has enough user data in its database to train it, is flexible enough to be regenerated over time, thus accounting for changes in the type of applications executed by users as well as ensuring its robustness.
|
|
|
|
|
|
|
|
### Input
|
|
|
|
|
|
|
|
To achieve a balance between classification quality and runtime performance, this strategy works with a subset of signature metrics, conformed by `CPI`, `TPI`, `GFLOPS` and `MEM_GBS`. This way, it combines metrics both from the architecture and the application's computational activity.
|
|
|
|
|
|
|
|
To use this strategy, two files are needed: one for the standardization process, and another one for the medoids. Regarding the first one, it stores the mean and standard deviations in the following format:
|
|
|
|
```shell
|
|
|
|
{CPI std} {CPI mean} {TPI std} {TPI mean} {GFLOPS std} {GFLOPS mean} {MEM_GBS std} {MEM_GBS mean}
|
|
|
|
```
|
|
|
|
This first file must follow the format name `extremes.{tag}.data`, where `tag` corresponds to the tag of the node partition.
|
|
|
|
|
|
|
|
Now, regarding the second file, it stores the medoids in the following format:
|
|
|
|
```shell
|
|
|
|
{CPU-bound CPI} {CPU-bound TPI} {CPU-bound GFLOPS} {CPU-bound MEM_GBS} {MEMORY-bound CPI} {MEMORY-bound TPI} {MEMORY-bound GFLOPS} {MEMORY-bound MEM_GBS} {MIX CPI} {MIX TPI} {MIX GFLOPS} {MIX MEM_GBS}
|
|
|
|
```
|
|
|
|
This file must also follow a format name `medoids.{tag}.data`, where `tag` corresponds to the tag of the node partition.
|
|
|
|
|
|
|
|
**NOTE**: the `tag` will depend on the environment the user is working on. Check [Tags](Configuration#tags) for more information on this.
|
|
|
|
|
|
|
|
**NOTE 2**: as in the _roofline_ case, it is expected that both `medoids` and `extremes` files are stored in the `${EAR_ETC}/ear/coeffs/island{X}` directory. Check the [EARL configuration](Configuration#earl-configuration) and the [Island description](Configuration#island-description) sections for more details on this. Additionally, like in the previous case, there exists an environment variable called `HACK_MEDOIDS_COEFF_PATH` which, as defined in the [Paths modification](EAR-research-envvars#paths-modification) section, allows to hack the path to the coefficients.
|
|
|
|
|
|
|
|
Finally, just as with the roofline strategy, the EAR team offers its own tool, called ear-kmedoids (<ins>still in development</ins>), which allows computing and generating automatically both files with the format required by EAR, and in a user-friendly flavour.
|
|
|
|
|
|
|
|
### Classification strategy
|
|
|
|
|
|
|
|
Once EAR has access to the `extremes` and `medoids` files, the classification proceeds as follows:
|
|
|
|
|
|
|
|
```
|
|
|
|
Let S be the last registered signature
|
|
|
|
Let <CPU_MEDOID, MEM_MEDOID, MIX_MEDOID> be the tuple of medoids
|
|
|
|
Let V = [S->CPI, S->TPI, S->GFLOPS, S->MEM_GBS]
|
|
|
|
Standardize vector V
|
|
|
|
Let D1 = euclidean_distance(V, CPU_MEDOID)
|
|
|
|
Let D2 = euclidean_distance(V, MEM_MEDOID)
|
|
|
|
Let D3 = euclidean_distance(V, MIX_MEDOID)
|
|
|
|
IF (D1 < D2 and D1 < D3)
|
|
|
|
Mark app as CPU-bound
|
|
|
|
ELSE IF (D2 < D1 and D2 < D3)
|
|
|
|
Mark app as MEMORY-bound
|
|
|
|
ELSE
|
|
|
|
Mark app as MIX
|
|
|
|
```
|
|
|
|
|
|
|
|
Let us go over these steps:
|
|
|
|
1. The strategy begins by selecting the targeted subset of signature metrics and standardizing them. This step implicitly involves EAR preloading the extremes (i.e., mean and std of each metric) during the initialization of the Library, thus doing such loading just once per execution.
|
|
|
|
2. Once the metrics have been standardized, we compute the euclidean distance of this vector to the different medoids, which are also preloaded during the initialization of the Library (just like extremes). We name these distances D1, D2 and D3.
|
|
|
|
3. With these distances computed, we identify the current execution phase of the app by checking which medoid is closer to the signature.
|
|
|
|
|
|
|
|
With this strategy, EAR's classification becomes more flexible and adaptative while maintaining its robustness, which helps attain the maximum performance expected out of this Library's module.
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
As a final note, the Library selects the classification strategy by prioritizing the *K-medoids* over the *Roofline*, and this latter one over the *Default* strategy. It can be thought of in the following way:
|
|
|
|
```
|
|
|
|
IF K-medoids strategy is available:
|
|
|
|
Load K-medoids strategy
|
|
|
|
ELSE IF Roofline is available:
|
|
|
|
Load roofline strategy
|
|
|
|
ELSE
|
|
|
|
Load default strategy
|
|
|
|
```
|
|
|
|
|
|
|
|
Thus, the user must be aware of the files available and accessible to EAR, which will determine how EAR approaches signature classification.
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
<a name="stream">[1]</a> McCalpin, John. “STREAM: Sustainable memory bandwidth in high performance computers.” _http://www.cs.virginia.edu/stream/_ (2006).
|
|
|
|
|
|
|
|
<a name="roofline">[2]</a> Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. _Communications of the ACM_, _52_(4), 65-76.
|
|
|
|
|
|
|
|
<a name="peak-flops">[3]</a> Andreolli, C., Thierry, P., Borges, L., Skinner, G., Yount, C., Jeffers, J., & Reinders, J. (2015). Characterization and optimization methodology applied to stencil computations. _High Performance Parallelism Pearls_, 377-396.
|
|
|
|
|
|
|
|
<a name="peak-gbs">[4]</a> Bakos, J. D. (2016). Multicore and data-level optimization. In Embedded Systems (pp. 49–103). Elsevier. https://doi.org/10.1016/b978-0-12-800342-8.00002-x
|
|
|
|
|
|
|
|
<a name="hpl">[5]</a> Petitet, Antoine. “HPL-a portable implementation of the high-performance Linpack benchmark for distributed-memory computers.” _http://www.netlib.org/benchmark/hpl/_ (2004).
|
|
|
|
|
|
|
|
<a name="k-means">[6]</a> Blömer, J., Lammersen, C., Schmidt, M., & Sohler, C. (2016). Theoretical analysis of the k-means algorithm–a survey. _Algorithm Engineering: Selected Results and Surveys_, 81-116.
|
|
|
|
|
|
|
|
<a name="k-medoids">[7]</a> Park, H. S., & Jun, C. H. (2009). A simple and fast algorithm for K-medoids clustering. _Expert systems with applications_, _36_(2), 3336-3341.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|