Installation overview#

The OSD can be installed on a cloud of your choosing. The only common component used across deployments is juju itself. You must have the Juju client installed on your local system in order to administer the installation and deployment of the Slurm charms.

Install Juju#

$ sudo snap install juju --classic

Once the Juju client is installed you will be ready to proceed with deploying Slurm to a cloud of your choosing.

Setup cloud#

Follow the documentation for the cloud you with to deploy Slurm on:

AWS#

Login to a Juju controller#

In any deployment scenario we will need to be logged into a Juju controller. For this example we will login to the public Juju controller, JAAS.

$ juju login jaas

Add a model#

Once you are logged into a Juju controller you need to add a model. Run the following command to add the model that will house the OSD.

$ juju add-model slurm aws/us-west-2

LXD#

Bootstrap a localhost LXD Juju controller#

In any deployment scenario we will need to be logged into a Juju controller. For this example we will bootstrap a Juju controller in a LXD container on our local machine.

Install and configure LXD, if you haven't already:

$ sudo snap install lxd
$ lxd init --auto
$ lxc network set lxdbr0 ipv6.address none

Note

Juju does not support IPv6, the last command disables it.

You can now bootstrap your local cloud:

$ juju bootstrap localhost

Following a successful bootstrap, juju controllers will show your newly provisioned LXD controller.

CentOS7 Deploys on LXD clouds#

You need to manually create an LXD image for CentOS7 in order to deploy it with Juju.

First step is to download a configuration file describing the image:

$ wget https://raw.githubusercontent.com/lxc/lxc-ci/master/images/centos.yaml

Juju needs two extra packages (sudo and openssh-server) that are not in the base image. You need to manually add them in the packages section of the yaml file. The first set of packages in the file should then be:

packages:
  manager: yum
  update: true
  cleanup: true
  sets:
  - packages:
    - cronie
    - cronie-noanacron
    - curl
    - dhclient
    - initscripts
    - openssh-clients
    - passwd
    - policycoreutils
    - rootfiles
    - rsyslog
    - vim-minimal
    - sudo
    - openssh-server
    action: install

Now we need to install distrobuilder and generate the image:

$ sudo snap install distrobuilder --classic
$ sudo distrobuilder build-lxd centos.yaml -o image.architecture=x86_64 -o image.release=7 -o image.variant=cloud

To make this new image available to Juju, we need to import it with an alias:

$ lxc image import lxd.tar.xz rootfs.squashfs --alias juju/centos7/amd64

You can check that the image was correctly imported to LXD with lxc image list. To test it works with Juju, you can juju add-machine --series centos7.

Add a model#

Once you have created your Juju controller you need to add a model. Run the following command to add the model that will house the OSD.

$ juju add-model slurm

MAAS#

Login to a Juju controller#

If you don't already have a Juju controller, bootstrap Juju by creating a Juju controller machine.

$ juju bootstrap

Add a model#

Once you are logged into a Juju controller you need to add a model. Run the following command to add the model that will house the OSD.

$ juju add-model slurm

Deploy Slurm#

Now it is time to get Slurm :)

We provide a bundle and overlays to simplify deploying all the components needed for a complete Slurm cluster in the repository slurm-bundles. First, clone the repository and then enter it:

$ git clone https://github.com/omnivector-solutions/slurm-bundles
$ cd slurm-bundles

The slurm-core directory contains all the bundles and overlays to deploy a basic Slurm cluster:

  • slurm-core/bundle.yaml: the basic definition of the Slurm components.

  • slurm-core/clouds/: overlays with specific settings for each supported cloud environment. Currently available ones are AWS and LXD.

  • slurm-core/series/: overlays to define the OS of the Slurm components, the possible options are CentOS7 and Ubuntu Focal (20.04).

  • slurm-core/charms/: overlays to change the source of the charms. By default, the bundle gets them from the latest/stable channel from Charmhub. The overlays in this directory allows us to change the source to either latest/edge channel or from your local machine. The latter one is specially useful for development, see Charm development for details.

The slurm-addons directory contains overlays to extend Slurm with plugins:

  • slurm-addons/influxdb.yaml: overlay to relate slurmctld to InfluxDB, to collect profiling information about the jobs. See InfluxDB profiling plugin for details on usage.

  • slurm-addons/monitoring.yaml: overlay to deploy prometheus2, prometheus-node-exporter and slurm-exporter for cluster monitoring. See Monitoring for details on usage. - slurm-addons/elasticsearch-acct.yaml: overlay to deploy elasticsearch and relate it to slurmctld to collect accounting information about the jobs. See ElasticSearch plugin for details. - slurm-addons/fluentbit.yaml: overlay to deploy fluentbit and relate it to the slurm-charms to forward logs to a centralized system. See Logging for details.

For example, to deploy Slurm to a local LXD cloud, on Ubuntu Focal, using the latest/stable charms:

$ juju deploy ./slurm-core/bundle.yaml \
              --overlay ./slurm-core/clouds/lxd.yaml \
              --overlay ./slurm-core/series/focal.yaml

Juju will then download the charms from Charmhub, create the applications, configurations, and LXD containers described in the respective files, which will comprise the model.

Note

The Slurm-charms install Slurm from Omnivector's OSD PPA on Ubuntu. It is possible to change the source to Omnivector's Testing PPA (or to a local cache server as well) with the configuration custom-slurm-repo. Setting this value to repositories other than Omnivector's PPAs is not supported and might result in a broken system.

For a more detailed guide to Slurm installation, see Slurm Installation.

It will take a moment get everything ready. You can check the status of your model with juju status:

$ watch -n 1 -c juju status --color

Model    Controller  Cloud/Region         Version  SLA          Timestamp
default  overlord    localhost/localhost  2.8.7    unsupported  17:44:29Z

App              Version  Status  Scale  Charm            Store       Channel  Rev  OS      Message
percona-cluster  5.7.20   active      1  percona-cluster  charmstore  stable   293  ubuntu  Unit is ready
slurmctld        0.6.4    active      1  slurmctld        charmhub    stable     7  ubuntu  slurmctld available
slurmd           0.6.4    active      1  slurmd           charmhub    stable    13  ubuntu  slurmd available
slurmdbd         0.6.4    active      1  slurmdbd         charmhub    stable     5  ubuntu  slurmdbd ready
slurmrestd       0.6.4    active      1  slurmrestd       charmhub    stable     5  ubuntu  slurmrestd available

Unit                   Workload  Agent  Machine  Public address  Ports     Message
percona-cluster/0*     active    idle   0        10.34.166.18    3306/tcp  Unit is ready
slurmctld/0*           active    idle   2        10.34.166.222             slurmctld available
slurmd/0*              active    idle   3        10.34.166.219             slurmd available
slurmdbd/0*            active    idle   4        10.34.166.218             slurmdbd available
slurmrestd/0*          active    idle   5        10.34.166.66              slurmrestd available

Machine  State    DNS            Inst id        Series  AZ  Message
0        started  10.34.166.18   juju-01ab62-0  bionic      Running
2        started  10.34.166.222  juju-01ab62-2  focal       Running
3        started  10.34.166.219  juju-01ab62-3  focal       Running
4        started  10.34.166.218  juju-01ab62-4  focal       Running
5        started  10.34.166.66   juju-01ab62-5  focal       Running

Once the workload status is active and the agent status is idle, the Slurm cluster is ready for use.

You can see the status of your cluster by running the sinfo command:

$ juju run --unit slurmctld/0 sinfo
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
osd-slurmd           up   infinite      1   down juju-01ab62-3

The nodes start in down state with a Reason = New node, so when you add more nodes to the cluster, they will not execute the jobs from queue. This way it is possible to do some post installation before setting the nodes as idle. You can double check that your nodes are down because of this and not some other reason with sinfo -R:

$ juju run --unit slurmctld/0 "sinfo -R"
REASON               USER      TIMESTAMP           NODELIST
New node             root      2021-03-09T20:24:09 ip-172-31-83-4

After setting the node up, to bring it back you need to run a Juju action:

$ juju run-action slurmd/1 node-configured
$ juju run --unit slurmctld/0 sinfo
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
osd-slurmd           up   infinite      1   idle juju-01ab62-3

Please refer to our Operations section for detailed instructions on how to manage the cluster.