Accounting And Profiling
Contents
Accounting And Profiling#
Slurm can collect accounting and profiling information about jobs and job steps. Please find below instructions on how to setup accounting for the different plugins, as well as the official documentation about accounting.
Slurmdbd#
slurmdbd
is used for job accounting.
ElasticSearch plugin#
The Slurm Elastic Search Plugin stores accounting data for finished jobs.
This plugin can be automatically enabled by relating the slurmctld
charm
with the Elastic Search Charm:
$ juju relate slurmctld elasticsearch
The slurmctld
will create a new index with the name of the cluster and the
document type is named jobcomp
.
Data Saved#
An example to get all the documents saved in the osd-cluster
index from the
Elastic Search server at 10.220.130.6
is:
$ curl -XGET 'http://10.220.130.6:9200/osd-cluster/_search?pretty=true&q=*:*'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "osd-cluster",
"_type" : "jobcomp",
"_id" : "bnhJ3n0BaChVi1VR3viV",
"_score" : 1.0,
"_source" : {
"jobid" : 24,
"username" : "john",
"user_id" : 1200,
"groupname" : "john",
"group_id" : 1200,
"@start" : "2021-12-21T18:37:02",
"@end" : "2021-12-21T18:37:02",
"elapsed" : 0,
"partition" : "osd-slurmd",
"alloc_node" : "juju-f48c73-285",
"nodes" : "juju-f48c73-286",
"total_cpus" : 16,
"total_nodes" : 1,
"derived_ec" : "0:0",
"exit_code" : "0:0",
"state" : "COMPLETED",
"cpu_hours" : 0.0,
"pack_job_id" : 0,
"pack_job_offset" : 0,
"het_job_id" : 0,
"het_job_offset" : 0,
"@submit" : "2021-12-21T18:37:02",
"@eligible" : "2021-12-21T18:37:02",
"queue_wait" : 0,
"work_dir" : "/home/john/project/foo",
"cluster" : "osd-cluster",
"qos" : "normal",
"ntasks" : 0,
"ntasks_per_node" : 0,
"ntasks_per_tres" : 0,
"cpus_per_task" : 1,
"job_name" : "hostname",
"tres_req" : "cpu=1,mem=15921M,node=1,billing=1",
"tres_alloc" : "cpu=16,node=1,billing=16",
"account" : "john",
"parent_accounts" : "/users/user"
}
},
{
"_index" : "osd-cluster",
"_type" : "jobcomp",
"_id" : "b3hJ3n0BaChVi1VR3vi0",
"_score" : 1.0,
"_source" : {
"jobid" : 25,
"username" : "root",
"user_id" : 0,
"groupname" : "root",
"group_id" : 0,
"@start" : "2021-12-21T18:37:25",
"@end" : "2021-12-21T18:37:25",
"elapsed" : 0,
"partition" : "osd-slurmd",
"alloc_node" : "juju-f48c73-285",
"nodes" : "juju-f48c73-286",
"total_cpus" : 16,
"total_nodes" : 1,
"derived_ec" : "0:0",
"exit_code" : "0:0",
"state" : "COMPLETED",
"cpu_hours" : 0.0,
"pack_job_id" : 0,
"pack_job_offset" : 0,
"het_job_id" : 0,
"het_job_offset" : 0,
"@submit" : "2021-12-21T18:37:25",
"@eligible" : "2021-12-21T18:37:25",
"queue_wait" : 0,
"work_dir" : "/root",
"cluster" : "osd-cluster",
"qos" : "normal",
"ntasks" : 0,
"ntasks_per_node" : 0,
"ntasks_per_tres" : 0,
"cpus_per_task" : 1,
"job_name" : "hostname",
"tres_req" : "cpu=1,mem=15921M,node=1,billing=1",
"tres_alloc" : "cpu=16,node=1,billing=16",
"account" : "root",
"parent_accounts" : "/root/root"
}
}
]
}
}
InfluxDB profiling plugin#
Slurm provides a profiling gathering plugin to collect metrics and send them to
InfluxDB. OSD encapsulates
the configuration of this plugin in a Juju relation between slurmctld
and
influxdb
charms.
A basic setup involves the following steps:
Deploy InfluxDB charm.
Relate
slurmctld
andinfluxdb
.[optional] Configure the accounting frequency.
The Juju commands to accomplish these steps are:
$ juju deploy influxdb
$ juju relate slurmctld influxdb
$ juju config slurmctld acct-gather-frequency="task=30"
In this scenario, slurmctld
will setup everything needed to collect and
save the metrics. This includes creating an user and a database in InfluxDB.
The username is slurm
and the password is generated at random, while name
of the database is the name of the cluster, as set in slurmctld
's
configuration cluster-name
.
Data saved#
Slurm collects profiling metrics at a frequency specified in the slurmctld
configuration option acct-gather-frequency
. The following field keys are
saved for the tasks:
CPUFrequency
CPU Frequency at time of sample.
Field type:
float
.CPUTime
Seconds of CPU time used during the sample.
Field type:
float
.CPUUtilization
CPU Utilization during the interval.
Field type:
float
RSS
Value of RSS at time of sample.
Field type:
float
.VMSize
Value of VM Size at time of sample.
Field type:
float
.Pages
Pages used in sample.
Field type:
float
.ReadMB
Number of megabytes read from local disk.
Field type:
float
.WriteMB
Number of megabytes written to local disk.
Field type:
float
.
Accessing the data#
The slurmctld
charm provides a convenient Juju Action to export the
InfluxDB parameters to setup a Grafana Data Source:
$ juju run-action slurmctld/leader influxdb-info --wait
unit-slurmctld-13:
UnitId: slurmctld/13
id: "573"
results:
influxdb: '{''ingress'': ''10.220.130.30'', ''port'': ''8086'', ''user'': ''slurm'',
''password'': ''LeCZSef2IzyOp3GAnYNC'', ''database'': ''osd-cluster'', ''retention_policy'':
''autogen''}'
status: completed
timing:
completed: 2021-07-20 13:00:35 +0000 UTC
enqueued: 2021-07-20 13:00:31 +0000 UTC
started: 2021-07-20 13:00:34 +0000 UTC