Accounting And Profiling#
Slurm can collect accounting and profiling information about jobs and job steps. Please find below instructions on how to setup accounting for the different plugins, as well as the official documentation about accounting.
Slurmdbd#
slurmdbd is used for job accounting.
ElasticSearch plugin#
The Slurm Elastic Search Plugin stores accounting data for finished jobs.
This plugin can be automatically enabled by relating the slurmctld charm
with the Elastic Search Charm:
$ juju relate slurmctld elasticsearch
The slurmctld will create a new index with the name of the cluster and the
document type is named jobcomp.
Data Saved#
An example to get all the documents saved in the osd-cluster index from the
Elastic Search server at 10.220.130.6 is:
$ curl -XGET 'http://10.220.130.6:9200/osd-cluster/_search?pretty=true&q=*:*'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "osd-cluster",
"_type" : "jobcomp",
"_id" : "bnhJ3n0BaChVi1VR3viV",
"_score" : 1.0,
"_source" : {
"jobid" : 24,
"username" : "john",
"user_id" : 1200,
"groupname" : "john",
"group_id" : 1200,
"@start" : "2021-12-21T18:37:02",
"@end" : "2021-12-21T18:37:02",
"elapsed" : 0,
"partition" : "osd-slurmd",
"alloc_node" : "juju-f48c73-285",
"nodes" : "juju-f48c73-286",
"total_cpus" : 16,
"total_nodes" : 1,
"derived_ec" : "0:0",
"exit_code" : "0:0",
"state" : "COMPLETED",
"cpu_hours" : 0.0,
"pack_job_id" : 0,
"pack_job_offset" : 0,
"het_job_id" : 0,
"het_job_offset" : 0,
"@submit" : "2021-12-21T18:37:02",
"@eligible" : "2021-12-21T18:37:02",
"queue_wait" : 0,
"work_dir" : "/home/john/project/foo",
"cluster" : "osd-cluster",
"qos" : "normal",
"ntasks" : 0,
"ntasks_per_node" : 0,
"ntasks_per_tres" : 0,
"cpus_per_task" : 1,
"job_name" : "hostname",
"tres_req" : "cpu=1,mem=15921M,node=1,billing=1",
"tres_alloc" : "cpu=16,node=1,billing=16",
"account" : "john",
"parent_accounts" : "/users/user"
}
},
{
"_index" : "osd-cluster",
"_type" : "jobcomp",
"_id" : "b3hJ3n0BaChVi1VR3vi0",
"_score" : 1.0,
"_source" : {
"jobid" : 25,
"username" : "root",
"user_id" : 0,
"groupname" : "root",
"group_id" : 0,
"@start" : "2021-12-21T18:37:25",
"@end" : "2021-12-21T18:37:25",
"elapsed" : 0,
"partition" : "osd-slurmd",
"alloc_node" : "juju-f48c73-285",
"nodes" : "juju-f48c73-286",
"total_cpus" : 16,
"total_nodes" : 1,
"derived_ec" : "0:0",
"exit_code" : "0:0",
"state" : "COMPLETED",
"cpu_hours" : 0.0,
"pack_job_id" : 0,
"pack_job_offset" : 0,
"het_job_id" : 0,
"het_job_offset" : 0,
"@submit" : "2021-12-21T18:37:25",
"@eligible" : "2021-12-21T18:37:25",
"queue_wait" : 0,
"work_dir" : "/root",
"cluster" : "osd-cluster",
"qos" : "normal",
"ntasks" : 0,
"ntasks_per_node" : 0,
"ntasks_per_tres" : 0,
"cpus_per_task" : 1,
"job_name" : "hostname",
"tres_req" : "cpu=1,mem=15921M,node=1,billing=1",
"tres_alloc" : "cpu=16,node=1,billing=16",
"account" : "root",
"parent_accounts" : "/root/root"
}
}
]
}
}
InfluxDB profiling plugin#
Slurm provides a profiling gathering plugin to collect metrics and send them to
InfluxDB. OSD encapsulates
the configuration of this plugin in a Juju relation between slurmctld and
influxdb charms.
A basic setup involves the following steps:
Deploy InfluxDB charm.
Relate
slurmctldandinfluxdb.[optional] Configure the accounting frequency.
The Juju commands to accomplish these steps are:
$ juju deploy influxdb
$ juju relate slurmctld influxdb
$ juju config slurmctld acct-gather-frequency="task=30"
In this scenario, slurmctld will setup everything needed to collect and
save the metrics. This includes creating an user and a database in InfluxDB.
The username is slurm and the password is generated at random, while name
of the database is the name of the cluster, as set in slurmctld's
configuration cluster-name.
Data saved#
Slurm collects profiling metrics at a frequency specified in the slurmctld
configuration option acct-gather-frequency. The following field keys are
saved for the tasks:
CPUFrequencyCPU Frequency at time of sample.
Field type:
float.CPUTimeSeconds of CPU time used during the sample.
Field type:
float.CPUUtilizationCPU Utilization during the interval.
Field type:
floatRSSValue of RSS at time of sample.
Field type:
float.VMSizeValue of VM Size at time of sample.
Field type:
float.PagesPages used in sample.
Field type:
float.ReadMBNumber of megabytes read from local disk.
Field type:
float.WriteMBNumber of megabytes written to local disk.
Field type:
float.
Accessing the data#
The slurmctld charm provides a convenient Juju Action to export the
InfluxDB parameters to setup a Grafana Data Source:
$ juju run-action slurmctld/leader influxdb-info --wait
unit-slurmctld-13:
UnitId: slurmctld/13
id: "573"
results:
influxdb: '{''ingress'': ''10.220.130.30'', ''port'': ''8086'', ''user'': ''slurm'',
''password'': ''LeCZSef2IzyOp3GAnYNC'', ''database'': ''osd-cluster'', ''retention_policy'':
''autogen''}'
status: completed
timing:
completed: 2021-07-20 13:00:35 +0000 UTC
enqueued: 2021-07-20 13:00:31 +0000 UTC
started: 2021-07-20 13:00:34 +0000 UTC