Monitoring#

We provide a bundle overlay to simplify deploying Prometheus, Prometheus node_exporter and slurm-exporter to monitor the cluster and each individual node.

Prometheus node_exporter#

The subordinate charm prometheus-node-exporter can be used to to export machine metrics to a Prometheus instance. To monitor all nodes in the cluster, first deploy the application prometheus-node-exporter and then relate it to the nodes to be monitored:

$ juju deploy prometheus-node-exporter
$ juju relate prometheus-node-exporter slurmd
$ juju relate prometheus-node-exporter slurmctld
$ juju relate prometheus-node-exporter slurmdbd

This charm exposes by default all the metrics on endpoint /metrics using the port 9100.

The charm prometheus-node-exporter can be related to the prometheus2 charm to automatically scrape all units. Deploy Prometheus and relate it to node exporter to access this functionality:

$ juju deploy prometheus2
$ juju relate prometheus-node-exporter:prometheus prometheus2:scrape

Please refer to these charms' documentation for configuration details.

Prometheus Slurm exporter#

The subordinate charm slurm-exporter exports metrics about Slurm, such as the state of nodes, jobs, partitions, accounts, scheduler, CPUs, and GPUs. To monitor the cluster, deploy the application and relate it to slurmrestd-charm:

$ juju deploy slurm-exporter
$ juju relate slurm-exporter slurmrestd

Note

We recommend deploying slurm-exporter in the slurmrestd node. This component could be deployed in other nodes.

This charm exposes by default all the metrics on endpoint /metrics using the port 9120.

The charm slurm-exporter can be related to the prometheus2 charm to automatically scrape its metrics. Deploy Prometheus and relate it to slurm-exporter to access this functionality:

$ juju deploy prometheus2
$ juju relate slurm-exporter:prometheus prometheus2:scrape

Please refer to these charms' documentation for configuration details.

You can use the Grafana Dashboard 4323 to visualize the metrics exported via slurm-exporter.