NHC#

OSD installs LBNL Node Health Check (NHC) by default in the slurmd charm. NHC is an utility to help to prevent jobs to run on unhealthy nodes. In order to identify if a node is healthy or not, NHC runs checks periodically. These health checks can be customized and tuned to each particular cluster, node, and/or hardware.

The base configuration file contains only basic checks to ensure Slurm and Munge processes are active. You can easily extend the NHC configuration to match your setup.

When NHC identifies a node to be unhealthy, NHC drains this node to prevent future jobs from running on it.

Acquire and Provide NHC#

Before the slurmd charm can run NHC must be installed, the .tar.gz must be supplied:

Acquire NHC#

$ wget https://github.com/mej/nhc/releases/download/1.4.3/lbnl-nhc-1.4.3.tar.gz

Providing NHC#

$ juju deploy slurmd --resource nhc=lbnl-nhc-1.4.3.tar.gz

Configuration#

Health Checks#

Note

OSD uses short hostnames (hostname -s) as node identifiers in Slurm. Because of this, the NHC configuration needs to use the short hostname also. The base NHC configuration provided in the slurmd charm takes care of this by setting * || HOSTNAME="$HOSTNAME_S" at the top of the NHC configuration file. This value will not be overridden by any custom user supplied NHC configuration.

This is specially important if running checks on specific nodes on the cluster. For example, to only run the Nvidia monitoring on the gpu-* nodes:

gpu-* || NVIDIA_HEALTHMON=...

For example, suppose you want to also check for: 100 Gb/sec Infiniband, and the /scratch partition to be mounted as r/w. The easiest way to do so is to create a custom-conf.nhc file with those checks:

* || check_fs_mount_rw -f /scratch
* || check_hw_ib 100

And then configure your slurmd application to use it:

$ juju config slurmd nhc-conf="$(cat custom-conf.nhc)"

Note that this appends your custom configurations to the charm defined nhc.conf, without overwriting the pre-existing checks defined by the charm.

We also provide an action to see the currently used nhc.conf, which contains our base checks in addition to your custom ones:

$ juju run-action slurmd/leader show-nhc-config --wait

NHC options#

The default settings used in slurm.conf for NHC are as follows:

HealthCheckProgram=/usr/sbin/omni-nhc-wrapper
HealthCheckInterval=600
HealthCheckNodeState=ANY,CYCLE

These values implies that NHC will run at every 600 seconds (10 minutes), on all compute nodes regardless of their state (even on allocated nodes), but it will not run on all of them at the same time.

The /usr/sbin/omin-nhc-wrapper script allows you to supply custom arguments to change how Slurm invokes the Health Check scripts via a charm-slurmctld configuration. For example, to configure NHC to send an e-mail to admin@company.com with the subject header NHC errors when it detects an error, change the health-check-params configuration to:

$ juju config slurmctld health-check-params='-M admin@company.com -S "NHC errors"'

Please check the documentation for NHC for configuration details.

It is possible to change the interval (in seconds) that NHC runs and the node states to perform the checks:

$ juju config slurmd health-check-interval=300
$ juju config slurmd health-check-state="CYCLE,ANY"

Note

NHC does not undrain a node. If a node was drained and NHC runs on that node, the node will continue on the drained state, regardless of the checks passing or failing.

This ensures that if someone drained a node for troubleshooting, it will not be resumed before the administrator finishes their tasks.

Please refer to the slurm.conf documentation for configuration details.