Infiniband#

The slurmd charm contains support for Infiniband driver lifecycle operations.

Using the slurmd charm actions we can configure a repository and install Infiniband drivers from it.

The general workflow for installing Infiniband drivers from a custom repository includes setting up the driver repository, installing the drivers, and rebooting the node (in that order).

By default, OSD uses the Mellanox repository for OFED 5.4. If you want a different version, or a different repository, you can create a text file describing the repository and configure the charms to use it. If a custom repository is not set, the charms will use the Mellanox repository. For example, to use Mellanox OFED 4.9, you can download the appropriate repository file to your operating system:

Note

The value passed to the charm action should be base64 encoded.

It is possible to query the currently used repository, as a way to check the configuration:

$ juju run-action slurmd/leader get-infiniband-repo --wait

For example, to download the repository file for OFED 4.9 in CentOS7 and install the drivers in a loop to cover all 150 compute nodes:

curl -O repository \
     https://linux.mellanox.com/public/repo/mlnx_ofed/4.9-2.2.4.0/rhel7.9/mellanox_mlnx_ofed.repo

repo=$(cat repository | base64)
for i in {{0..150}}; do
     juju run-action compute/$i set-infiniband-repo repo="$repo" --wait
     juju run-action compute/$i install-infiniband
done

The charm will install a package named mlnx-ofed-all. Note that this procedure takes some time. After the drivers are installed, you need to reboot the nodes. An example to reboot all those 150 nodes:

for i in {{0..150}}; do
     juju ssh compute/$i sudo reboot
done

After the node reboots, the Infiniband service should be enabled and active. To query its state, use the is-active-infiband action for the compute node:

$ juju run-action compute/42 is-active-infiniband --wait
unit-compute-42:
  UnitId: compute/42
  id: "899"
  results:
    infiniband-is-active: "True"
  status: completed
  timing:
    completed: 2021-12-17 16:32:40 +0000 UTC
    enqueued: 2021-12-17 16:32:38 +0000 UTC
    started: 2021-12-17 16:32:39 +0000 UTC

It is also possible to run the ibstat utility over a juju ssh command to query the Infiniband capabilities and double check the link is up:

$ juju ssh compute/42 /usr/sbin/ibstat
CA 'mlx5_0'
     CA type: MT4115
     Number of ports: 1
     Firmware version: 12.25.1020
     Hardware version: 0
     Node GUID: 0x506b4b0fabede600
     System image GUID: 0x
     Port 1:
             State: Active
             Physical state: LinkUp
             Rate: 100
             Base lid: 43
             LMC: 0
             SM lid: 3
             Capability mask: 0x2651e000
             Port GUID: 0x506b4b0fabede600
             Link layer: InfiniBand
Connection to 10.14.192.42 closed.