| ... | ... | @@ -2,14 +2,14 @@ |
|
|
|
|
|
|
|
### Application information
|
|
|
|
|
|
|
|
The following tables contain information directly related to applications executed on the system while EAR was monitoring. The main key is the JOBID.STEPID combination generated by the scheduler.
|
|
|
|
The following tables contain information directly related to applications executed on the system while EAR was monitoring. The main key is the JOBID.STEPID.APPID combination generated by the scheduler and EAR.
|
|
|
|
|
|
|
|
- **Jobs**: job information (app_id, user_id, job_id, step_id, etc). One record per JOBID.STEPID is created in the DB.
|
|
|
|
- **Applications**: this table's records serve as a link between Jobs and Signatures, providing an application signature (from EARL) for each node of a job. One record per JOBID.STEPID.NODENAME is created in the DB.
|
|
|
|
- **Jobs**: job information (app_name, user_name, job_id, step_id, app_id, etc). One record per JOBID.STEPID.APPID is created in the DB.
|
|
|
|
- **Applications**: this table's records serve as a link between Jobs and Signatures, providing an application signature (from EARL) for each node of a job. One record per JOBID.STEPID.APPID.NODENAME is created in the DB.
|
|
|
|
- **Loops**: similar to _Applications_, but stores a Signature for each application loop detected by EARL, instead of one per each application. This table provides internal details of running applications and could significantly increase the DB size.
|
|
|
|
- **Signatures**: EARL computed signature and metrics. One record per JOBID.STEPID.NODENAME is created in the DB when the application is executed with EARL.
|
|
|
|
- **Signatures**: EARL computed signature and metrics. One record per JOBID.STEPID.APPID.NODENAME is created in the DB when the application is executed with EARL.
|
|
|
|
- **GPU_signatures**: EARL computed GPU signatures. This information belongs to a loop or application signature. If the signature is from a node with 4 GPUs there will be 4 records.
|
|
|
|
- **Power_signatures**: Basic time and power metrics that can be obtained without EARL. Reported for all applications. One record per JOBID.STEPID.NODENAME is created in the DB.
|
|
|
|
- **Power_signatures**: Basic time and power metrics that can be obtained without EARL. Reported for all applications. One record per JOBID.STEPID.APPID.NODENAME is created in the DB.
|
|
|
|
|
|
|
|
### System monitoring
|
|
|
|
|
| ... | ... | @@ -49,7 +49,7 @@ For more information on this commands, check the [commands' page on the wiki](ht |
|
|
|
When running `edb_create` some tables might not be created, or may have some quirks, depending on some `ear.conf` settings. The settings and alterations are as follows:
|
|
|
|
|
|
|
|
- `DBReportNodeDetail`: if set to 1, `edb_create` will create two additional columns in the _Periodic_metrics_ table for Temperature (in Celsius) and Frequency (in Hz) accounting.
|
|
|
|
- `DBReportSigDetail`: if set to 1, _Signatures_ will have additional fields for cycles, instructions, and FLOPS1-8 counters (number of instruction by type).
|
|
|
|
- `DBReportSigDetail`: if set to 1, _Signatures_ will have additional fields for cycles, instructions, and floating point operation counters by precision and width.
|
|
|
|
- `DBMaxConnections`: this will restrict the number of maximum simultaneous commands connections.
|
|
|
|
|
|
|
|
If any of the settings is set to 0, the table will have fewer details but the table's records will be smaller in stored size.
|
| ... | ... | @@ -78,14 +78,108 @@ Additionally, if EAR was compiled in a system with GPUs (or with the GPU flag ma |
|
|
|
|
|
|
|
This section covers how to manually update the EAR Database tables if you update the EAR version and you want to maintain your current database data.
|
|
|
|
|
|
|
|
### From EAR 6.0 to 7.0
|
|
|
|
|
|
|
|
EAR 7.0 renames several database columns to make their meaning more explicit. In the job and application tables, the previous `local_id` field is now named `app_id`, while the previous application name field is now named `app_name`. Signature tables also use more descriptive metric names.
|
|
|
|
|
|
|
|
For the job table:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN app_id app_name VARCHAR(128);
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN local_id app_id INT UNSIGNED NOT NULL;
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN user_id user_name VARCHAR(128);
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN start_mpi_time earl_start_time INT NOT NULL;
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN end_mpi_time earl_end_time INT NOT NULL;
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN def_f def_cpu_freq INT UNSIGNED;
|
|
|
|
ALTER TABLE Jobs CHANGE COLUMN def_gpu_f def_gpu_freq INT UNSIGNED;
|
|
|
|
ALTER TABLE Jobs DROP PRIMARY KEY, ADD PRIMARY KEY (job_id, step_id, app_id);
|
|
|
|
```
|
|
|
|
|
|
|
|
If learning applications are enabled, apply the same changes to `Learning_jobs`. `Learning_jobs` uses wider text fields in newly created databases:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN app_id app_name VARCHAR(256);
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN local_id app_id INT UNSIGNED NOT NULL;
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN user_id user_name VARCHAR(256);
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN start_mpi_time earl_start_time INT NOT NULL;
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN end_mpi_time earl_end_time INT NOT NULL;
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN def_f def_cpu_freq INT UNSIGNED;
|
|
|
|
ALTER TABLE Learning_jobs CHANGE COLUMN def_gpu_f def_gpu_freq INT UNSIGNED;
|
|
|
|
ALTER TABLE Learning_jobs DROP PRIMARY KEY, ADD PRIMARY KEY (job_id, step_id, app_id);
|
|
|
|
```
|
|
|
|
|
|
|
|
For the application table:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Applications CHANGE COLUMN local_id app_id INT UNSIGNED NOT NULL;
|
|
|
|
ALTER TABLE Applications CHANGE COLUMN node_id node_name VARCHAR(64);
|
|
|
|
ALTER TABLE Applications CHANGE COLUMN signature_id earl_signature_id BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Applications CHANGE COLUMN power_signature_id eard_signature_id INT UNSIGNED;
|
|
|
|
ALTER TABLE Applications DROP PRIMARY KEY, ADD PRIMARY KEY (job_id, step_id, app_id, node_name);
|
|
|
|
```
|
|
|
|
|
|
|
|
If learning applications are enabled, apply the same changes to `Learning_applications`:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Learning_applications CHANGE COLUMN local_id app_id INT UNSIGNED NOT NULL;
|
|
|
|
ALTER TABLE Learning_applications CHANGE COLUMN node_id node_name VARCHAR(64);
|
|
|
|
ALTER TABLE Learning_applications CHANGE COLUMN signature_id earl_signature_id BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Learning_applications CHANGE COLUMN power_signature_id eard_signature_id INT UNSIGNED;
|
|
|
|
ALTER TABLE Learning_applications DROP PRIMARY KEY, ADD PRIMARY KEY (job_id, step_id, app_id, node_name);
|
|
|
|
```
|
|
|
|
|
|
|
|
For the loops table:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Loops CHANGE COLUMN `event` entry INT UNSIGNED NOT NULL;
|
|
|
|
ALTER TABLE Loops CHANGE COLUMN local_id app_id INT UNSIGNED NOT NULL;
|
|
|
|
ALTER TABLE Loops CHANGE COLUMN node_id node_name VARCHAR(64);
|
|
|
|
ALTER TABLE Loops CHANGE COLUMN total_iterations `timestamp` INT UNSIGNED;
|
|
|
|
```
|
|
|
|
|
|
|
|
For the signatures table:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN DC_power node_power FLOAT;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN DRAM_power dram_power FLOAT;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN PCK_power pck_power FLOAT;
|
|
|
|
ALTER TABLE Signatures DROP COLUMN EDP;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN GBS dram_bandwidth FLOAT;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN IO_MBS io_bandwidth FLOAT;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN Gflops cpu_gflops FLOAT;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN `time` elapsed_time FLOAT;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN avg_f avg_cpu_freq INT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN avg_imc_f avg_imc_freq INT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN def_f def_cpu_freq INT UNSIGNED;
|
|
|
|
```
|
|
|
|
|
|
|
|
If the database was created with extended application signatures, also rename the FLOPS columns:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS1 sp64_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS2 sp128_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS3 sp256_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS4 sp512_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS5 dp64_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS6 dp128_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS7 dp256_ops BIGINT UNSIGNED;
|
|
|
|
ALTER TABLE Signatures CHANGE COLUMN FLOPS8 dp512_ops BIGINT UNSIGNED;
|
|
|
|
```
|
|
|
|
|
|
|
|
Apply the same `Signatures` changes to `Learning_signatures` if learning applications are enabled.
|
|
|
|
|
|
|
|
### From EAR 5.0 to 6.0
|
|
|
|
|
|
|
|
Since 6.0 EAR reports the CPU utilization of applications in both loop and application signatures:
|
|
|
|
Since 6.0, EAR reports the CPU utilization of applications in both loop and application signatures as well as GPU Gflops:
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE Signatures ADD COLUMN cpu_util INT UNSIGNED AFTER def_f;
|
|
|
|
```
|
|
|
|
|
|
|
|
```sql
|
|
|
|
ALTER TABLE GPU_signatures ADD COLUMN GPU_gflops FLOAT;
|
|
|
|
```
|
|
|
|
|
|
|
|
### From EAR 4.3 to 5.0
|
|
|
|
|
|
|
|
For better internal consistency, Jobs' id field was renamed:
|
| ... | ... | @@ -200,62 +294,65 @@ EAR's database contains several tables, as described [here](#tables). Each table |
|
|
|
|
|
|
|
### Jobs
|
|
|
|
|
|
|
|
* <span dir="">id: Job id given by the scheduler (for example SLURM_JOBID).</span>
|
|
|
|
* <span dir="">job_id: Job id given by the scheduler (for example SLURM_JOBID).</span>
|
|
|
|
* <span dir="">step_id: step id given by the scheduler.</span>
|
|
|
|
* user_id: the linux username that executed the job.
|
|
|
|
* app_id: the application/job name as given by the scheduler (not necessarily the executable’s name)
|
|
|
|
* app_id: local application id generated by EAR to identify applications or workflow steps within the same scheduler job and step.
|
|
|
|
* user_name: the linux username that executed the job.
|
|
|
|
* app_name: the application/job name as given by the scheduler (not necessarily the executable's name).
|
|
|
|
* start_time: timestamp of the job’s\[.step\] start
|
|
|
|
* end_time: timestamp of the job’s\[.step\] end
|
|
|
|
* start_mpi_time: timestamp of the beginning of application region managed by the EARL. Named MPI for historical reasons. For MPI applications timestamp of the MPI_Init execution.
|
|
|
|
* end_mpi_time: timestamp of the end of application region managed by the EARL. Named MPI for historical reasons. For MPI applications timestamp of the MPI_Finalize execution.
|
|
|
|
* earl_start_time: timestamp of the beginning of application region managed by the EARL. For MPI applications, timestamp of the MPI_Init execution.
|
|
|
|
* earl_end_time: timestamp of the end of application region managed by the EARL. For MPI applications, timestamp of the MPI_Finalize execution.
|
|
|
|
* policy: EAR policy name in action for the job. Can be “No Policy” if the job runs without EAR.
|
|
|
|
* threshold: threshold used by the policy to configure it’s behavior. For example, the maximum performance penalty in min_energy.
|
|
|
|
* job_type:
|
|
|
|
* def_f: default CPU frequency requested by the user/job manager.
|
|
|
|
* user_acc: the account the user_id belongs to.
|
|
|
|
* user_group: the linux group name the user_id belongs to.
|
|
|
|
* num_procs: number of processes used by the application.
|
|
|
|
* def_cpu_freq: default CPU frequency requested by the user/job manager.
|
|
|
|
* def_gpu_freq: default GPU frequency requested by the user/job manager.
|
|
|
|
* user_acc: the account the user_name belongs to.
|
|
|
|
* user_group: the linux group name the user_name belongs to.
|
|
|
|
* e_tag: energy tag. The user can specify an energy tag to apply pre-defined CPU frequency settings.
|
|
|
|
|
|
|
|
### Applications
|
|
|
|
|
|
|
|
* job_id: job id given by the scheduler. Used as a foreign key for Jobs.
|
|
|
|
* step_id: step id given by the scheduler. Used as a foreign key for Jobs.
|
|
|
|
* node_id: the nodename in which the application ran. The names of the nodes are trimmed at any “.”, i.e., node1.at.cluster becomes node1.
|
|
|
|
* signature_id: the id (index) of the computed signature for the job on this node. If the job runs without EAR library the field will be NULL.
|
|
|
|
* power_signature_id: the id (index) of the power signature for the job on this node.
|
|
|
|
* app_id: local application id. Used as a foreign key for Jobs.
|
|
|
|
* node_name: the nodename in which the application ran. The names of the nodes are trimmed at any “.”, i.e., node1.at.cluster becomes node1.
|
|
|
|
* earl_signature_id: the id (index) of the computed EARL signature for the job on this node. If the job runs without EAR library the field will be NULL.
|
|
|
|
* eard_signature_id: the id (index) of the power signature reported by EARD for the job on this node.
|
|
|
|
|
|
|
|
### Signatures
|
|
|
|
|
|
|
|
_All the metrics in this table refer to the period of time where the Signature is computed. Typically is 10 sec. Signatures are only reported when the application uses the EAR library._
|
|
|
|
|
|
|
|
* id: unique id generated by the database engine to be used in JOIN queries.
|
|
|
|
* DC_power: average DC node power (in Watts).
|
|
|
|
* DRAM_power: average DRAM power, including the 2 sockets (in Watts).
|
|
|
|
* PCK_power: Average CPU power, including the 2 sockets (in Watts).
|
|
|
|
* EDP: Energy Delay Product computed as (time x time x DC_power).
|
|
|
|
* GBS: Main memory bandwidth (GB/sec).
|
|
|
|
* IO_MBS: I/O read and write rate (MB/s).
|
|
|
|
* node_power: average DC node power (in Watts).
|
|
|
|
* dram_power: average DRAM power, including the 2 sockets (in Watts).
|
|
|
|
* pck_power: Average CPU power, including the 2 sockets (in Watts).
|
|
|
|
* dram_bandwidth: Main memory bandwidth (GB/sec).
|
|
|
|
* io_bandwidth: I/O read and write rate (MB/s).
|
|
|
|
* TPI: Main memory transactions per instruction.
|
|
|
|
* CPI: Cycles per instructions.
|
|
|
|
* Gflops: Giga Floating point operations, per second, generated by the application processes in the node. GFlops/sec.
|
|
|
|
* time: total execution time (in seconds)
|
|
|
|
* cpu_gflops: Giga Floating point operations, per second, generated by the application processes in the node. GFlops/sec.
|
|
|
|
* elapsed_time: total execution time (in seconds)
|
|
|
|
* perc_MPI: average percentage of MPI time vs computational time in the node. Includes all the application processes in the node.
|
|
|
|
* L1_misses: L1 cache misses counter.
|
|
|
|
* L2_misses: L2 cache misses counter.
|
|
|
|
* L3_misses: L3 cache misses counter.
|
|
|
|
* FLOPS1: Floating point operations Single precision 64 bits consumed by application processes in the node.
|
|
|
|
* FLOPS2: Floating point operations Single precision 128 bits consumed by application processes in the node.
|
|
|
|
* FLOPS3 Floating point operations Single precision 256 bits consumed by application processes in the node.
|
|
|
|
* FLOPS4: Floating point operations Single precision 512 bits consumed by application processes in the node.
|
|
|
|
* FLOPS5: Floating point operations Double precision 64 bits consumed by application processes in the node.
|
|
|
|
* FLOPS6: Floating point operations Double precision 128 bits consumed by application processes in the node.
|
|
|
|
* FLOPS7: Floating point operations Double precision 256 bits consumed by application processes in the node.
|
|
|
|
* FLOPS8: Floating point operations Double precision 512 bits consumed by application processes in the node.
|
|
|
|
* sp64_ops: Floating point operations Single precision 64 bits consumed by application processes in the node.
|
|
|
|
* sp128_ops: Floating point operations Single precision 128 bits consumed by application processes in the node.
|
|
|
|
* sp256_ops: Floating point operations Single precision 256 bits consumed by application processes in the node.
|
|
|
|
* sp512_ops: Floating point operations Single precision 512 bits consumed by application processes in the node.
|
|
|
|
* dp64_ops: Floating point operations Double precision 64 bits consumed by application processes in the node.
|
|
|
|
* dp128_ops: Floating point operations Double precision 128 bits consumed by application processes in the node.
|
|
|
|
* dp256_ops: Floating point operations Double precision 256 bits consumed by application processes in the node.
|
|
|
|
* dp512_ops: Floating point operations Double precision 512 bits consumed by application processes in the node.
|
|
|
|
* instructions: total instructions executed by the application processes in the node.
|
|
|
|
* cycles: total cycles consumed by the application processes in the node.
|
|
|
|
* avg_f: average CPU frequency (includes all the cores used by the application on the node) in KHz.
|
|
|
|
* avg_imc_f: average memory frequency (includes the two sockets) in KHz.
|
|
|
|
* def_f: default CPU frequency used at the beginning of the application in KHz.
|
|
|
|
* avg_cpu_freq: average CPU frequency (includes all the cores used by the application on the node) in KHz.
|
|
|
|
* avg_imc_freq: average memory frequency (includes the two sockets) in KHz.
|
|
|
|
* def_cpu_freq: default CPU frequency used at the beginning of the application in KHz.
|
|
|
|
* cpu_util: average CPU utilization for the reported period.
|
|
|
|
* min_GPU_sig_id: start of the range containing the GPU_signature’s ids, used for JOIN queries. If an application doesn’t have GPUs it will be NULL.
|
|
|
|
* max_GPU_sig_id: end of the range containing the GPU_signature’s ids, used for JOIN queries. If an application doesn’t have GPUs it will be NULL.
|
|
|
|
|
| ... | ... | @@ -286,6 +383,7 @@ _Power signatures are measured and reported by the EARD and reported for all the |
|
|
|
* GPU_mem_freq: average GPU memory frequency for a single GPU (in KHz)
|
|
|
|
* GPU_util: average GPU utilisation for the reported period for a single GPU. (percentage)
|
|
|
|
* GPU_mem_util: average GPU memory utilisation for the reported period for a single GPU.(percentage)
|
|
|
|
* GPU_gflops: Giga Floating point operations, per second, generated by the application processes in the GPU.
|
|
|
|
|
|
|
|
> If an application has more than 1 GPU there will be a signature for each of them.
|
|
|
|
|
| ... | ... | @@ -293,16 +391,17 @@ _Power signatures are measured and reported by the EARD and reported for all the |
|
|
|
|
|
|
|
_Loops are only reported when the EAR library is used._
|
|
|
|
|
|
|
|
* event: loop type identificatory. It’s for internal use of the EAR library. Together with size and level is used internally.
|
|
|
|
* entry: loop type identificatory. It’s for internal use of the EAR library. Together with size and level is used internally.
|
|
|
|
* size: loop’s size as computed by DynAIS.
|
|
|
|
* level: loop’s level of depth (indicative of loops inside of loops)
|
|
|
|
* job_id: job id given by the job manager. Used as a foreign key for Jobs.
|
|
|
|
* step_id: step id given by the job manager. Used as a foreign key for Jobs.
|
|
|
|
* node_id: the nodema,e in which the application ran. The names of the nodes are trimmed at any “.”, i.e., node1.at.cluster becomes node1.
|
|
|
|
* total_iterations: timestamp at which the loop signature has been reported. It is named total_iterations for historical reasons.
|
|
|
|
* app_id: local application id. Used as a foreign key for Jobs.
|
|
|
|
* node_name: the nodename in which the application ran. The names of the nodes are trimmed at any “.”, i.e., node1.at.cluster becomes node1.
|
|
|
|
* timestamp: timestamp at which the loop signature has been reported.
|
|
|
|
* signature_id: the id of the computed signature for the job on this node.
|
|
|
|
|
|
|
|
> 1. the combination even-size-level forms the Primary Key for the table loops.
|
|
|
|
> 1. the combination entry-size-level identifies the loop for internal EARL usage.
|
|
|
|
> 2. Loops will always have a signature because they are only reported when EAR is used
|
|
|
|
> 3. When a loop is inserted, the corresponding Job is probably not in the database yet, because Jobs are inserted only when an application finishes. JOIN queries with Jobs can only be done once an application has finished (only the current step id needs to finish, not the entire job).
|
|
|
|
|
| ... | ... | |