.. _installation: ===================== Installation overview ===================== The OSD can be installed on a cloud of your choosing. The only common component used across deployments is juju_ itself. You must have the Juju client installed on your local system in order to administer the installation and deployment of the Slurm charms. .. _juju: https://juju.is Install Juju ============ .. code-block:: bash $ sudo snap install juju --classic Once the Juju client is installed you will be ready to proceed with deploying Slurm to a cloud of your choosing. Setup cloud =========== Follow the documentation for the cloud you with to deploy Slurm on: AWS ### Login to a Juju controller -------------------------- In any deployment scenario we will need to be logged into a Juju controller. For this example we will login to the public Juju controller, JAAS. .. code-block:: bash $ juju login jaas Add a model ----------- Once you are logged into a Juju controller you need to add a model. Run the following command to add the model that will house the OSD. .. code-block:: bash $ juju add-model slurm aws/us-west-2 LXD ### Bootstrap a localhost LXD Juju controller ----------------------------------------- In any deployment scenario we will need to be logged into a Juju controller. For this example we will bootstrap a Juju controller in a LXD container on our local machine. Install and configure LXD, if you haven't already: .. code-block:: bash $ sudo snap install lxd $ lxd init --auto $ lxc network set lxdbr0 ipv6.address none .. note:: Juju does not support IPv6, the last command disables it. You can now bootstrap your local cloud: .. code-block:: bash $ juju bootstrap localhost Following a successful bootstrap, ``juju controllers`` will show your newly provisioned LXD controller. .. _centos7-image: CentOS7 Deploys on LXD clouds ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You need to manually create an LXD image for CentOS7 in order to deploy it with Juju. First step is to download a configuration file describing the image: .. code-block:: bash $ wget https://raw.githubusercontent.com/lxc/lxc-ci/master/images/centos.yaml Juju needs two extra packages (`sudo` and `openssh-server`) that are not in the base image. You need to manually add them in the `packages` section of the yaml file. The first `set` of packages in the file should then be: .. code-block:: bash packages: manager: yum update: true cleanup: true sets: - packages: - cronie - cronie-noanacron - curl - dhclient - initscripts - openssh-clients - passwd - policycoreutils - rootfiles - rsyslog - vim-minimal - sudo - openssh-server action: install Now we need to install `distrobuilder` and generate the image: .. code-block:: bash $ sudo snap install distrobuilder --classic $ sudo distrobuilder build-lxd centos.yaml -o image.architecture=x86_64 -o image.release=7 -o image.variant=cloud To make this new image available to Juju, we need to import it with an alias: .. code-block:: bash $ lxc image import lxd.tar.xz rootfs.squashfs --alias juju/centos7/amd64 You can check that the image was correctly imported to LXD with ``lxc image list``. To test it works with Juju, you can ``juju add-machine --series centos7``. Add a model ----------- Once you have created your Juju controller you need to add a model. Run the following command to add the model that will house the OSD. .. code-block:: bash $ juju add-model slurm MAAS #### Login to a Juju controller -------------------------- If you don't already have a Juju controller, bootstrap Juju by creating a Juju controller machine. .. code-block:: bash $ juju bootstrap Add a model ----------- Once you are logged into a Juju controller you need to add a model. Run the following command to add the model that will house the OSD. .. code-block:: bash $ juju add-model slurm Deploy Slurm ============ Now it is time to get Slurm :) We provide a bundle and overlays to simplify deploying all the components needed for a complete Slurm cluster in the repository `slurm-bundles `_. First, clone the repository and then enter it: .. code-block:: bash $ git clone https://github.com/omnivector-solutions/slurm-bundles $ cd slurm-bundles The ``slurm-core`` directory contains all the bundles and overlays to deploy a basic Slurm cluster: - ``slurm-core/bundle.yaml``: the basic definition of the Slurm components. - ``slurm-core/clouds/``: overlays with specific settings for each supported cloud environment. Currently available ones are AWS and LXD. - ``slurm-core/series/``: overlays to define the OS of the Slurm components, the possible options are CentOS7 and Ubuntu Focal (20.04). - ``slurm-core/charms/``: overlays to change the source of the charms. By default, the bundle gets them from the ``latest/stable`` channel from Charmhub. The overlays in this directory allows us to change the source to either ``latest/edge`` channel or from your local machine. The latter one is specially useful for development, see :ref:`charm-development` for details. The ``slurm-addons`` directory contains overlays to extend Slurm with plugins: - ``slurm-addons/influxdb.yaml``: overlay to relate ``slurmctld`` to InfluxDB, to collect profiling information about the jobs. See :ref:`influxdb-profiling` for details on usage. - ``slurm-addons/monitoring.yaml``: overlay to deploy `prometheus2 `_, `prometheus-node-exporter `_ and `slurm-exporter `_ for cluster monitoring. See :ref:`monitoring` for details on usage. - ``slurm-addons/elasticsearch-acct.yaml``: overlay to deploy `elasticsearch` and relate it to ``slurmctld`` to collect accounting information about the jobs. See :ref:`elasticsearch-accounting` for details. - ``slurm-addons/fluentbit.yaml``: overlay to deploy `fluentbit` and relate it to the ``slurm-charms`` to forward logs to a centralized system. See :ref:`logging` for details. For example, to deploy Slurm to a local LXD cloud, on Ubuntu Focal, using the ``latest/stable`` charms: .. code-block:: bash $ juju deploy ./slurm-core/bundle.yaml \ --overlay ./slurm-core/clouds/lxd.yaml \ --overlay ./slurm-core/series/focal.yaml Juju will then download the charms from Charmhub, create the applications, configurations, and LXD containers described in the respective files, which will comprise the model. .. note:: The Slurm-charms install Slurm from `Omnivector's OSD PPA `_ on Ubuntu. It is possible to change the source to `Omnivector's Testing PPA `_ (or to a local cache server as well) with the configuration ``custom-slurm-repo``. Setting this value to repositories other than Omnivector's PPAs is not supported and might result in a broken system. For a more detailed guide to Slurm installation, see :ref:`installation-operation`. It will take a moment get everything ready. You can check the status of your model with ``juju status``: .. code-block:: bash $ watch -n 1 -c juju status --color Model Controller Cloud/Region Version SLA Timestamp default overlord localhost/localhost 2.8.7 unsupported 17:44:29Z App Version Status Scale Charm Store Channel Rev OS Message percona-cluster 5.7.20 active 1 percona-cluster charmstore stable 293 ubuntu Unit is ready slurmctld 0.6.4 active 1 slurmctld charmhub stable 7 ubuntu slurmctld available slurmd 0.6.4 active 1 slurmd charmhub stable 13 ubuntu slurmd available slurmdbd 0.6.4 active 1 slurmdbd charmhub stable 5 ubuntu slurmdbd ready slurmrestd 0.6.4 active 1 slurmrestd charmhub stable 5 ubuntu slurmrestd available Unit Workload Agent Machine Public address Ports Message percona-cluster/0* active idle 0 10.34.166.18 3306/tcp Unit is ready slurmctld/0* active idle 2 10.34.166.222 slurmctld available slurmd/0* active idle 3 10.34.166.219 slurmd available slurmdbd/0* active idle 4 10.34.166.218 slurmdbd available slurmrestd/0* active idle 5 10.34.166.66 slurmrestd available Machine State DNS Inst id Series AZ Message 0 started 10.34.166.18 juju-01ab62-0 bionic Running 2 started 10.34.166.222 juju-01ab62-2 focal Running 3 started 10.34.166.219 juju-01ab62-3 focal Running 4 started 10.34.166.218 juju-01ab62-4 focal Running 5 started 10.34.166.66 juju-01ab62-5 focal Running Once the workload status is *active* and the agent status is *idle*, the Slurm cluster is ready for use. You can see the status of your cluster by running the ``sinfo`` command: .. code-block:: bash $ juju run --unit slurmctld/0 sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST osd-slurmd up infinite 1 down juju-01ab62-3 The nodes start in *down* state with a ``Reason = New node``, so when you add more nodes to the cluster, they will not execute the jobs from queue. This way it is possible to do some post installation before setting the nodes as *idle*. You can double check that your nodes are down because of this and not some other reason with ``sinfo -R``: .. code-block:: bash $ juju run --unit slurmctld/0 "sinfo -R" REASON USER TIMESTAMP NODELIST New node root 2021-03-09T20:24:09 ip-172-31-83-4 After setting the node up, to bring it back you need to run a Juju *action*: .. code-block:: bash $ juju run-action slurmd/1 node-configured $ juju run --unit slurmctld/0 sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST osd-slurmd up infinite 1 idle juju-01ab62-3 Please refer to our :ref:`operations` section for detailed instructions on how to manage the cluster.