Say hello to HAL

Aug 11, 2015hpc, SLURM

Over the past few months we’ve been referring to a new type of scheduler that we’ve been testing. We decided to move away from PBS as MAUI is no longer being maintained. The scheduler we have selected, SLURM, is written by SchedMD, is freeware, well maintained with a clearly defined road map and development path and is used by many other HPC centers. In fact 60% of the Top500 clusters make use of SLURM, including Tianhe-2, currently the fastest supercomputer in the world.

Currently hal.uct.ac.za is running version 14.11, although 15.08 has recently been released. The hardware has been migrated from the old HPC cluster and consists of a single VM head node and several HP BL460 worker nodes. These are partitioned into two ‘queues’ depending on their processor architecture and memory. We are also simulating a set of GPU cards in one of the nodes as SLURM has the ability to schedule GPU resources. Another nice feature is its ability to schedule memory as a consumable resource.

The manner of submitting jobs is very similar to PBS’s qsub, however the commands are different and some of the directives appear similar to PBS but act differently. For instance in PBS one would state

#PBS -l nodes=2:ppn=4

and the scheduler would grant you 8 cores symmetrically across two nodes. In SLURM one writes

#SBATCH --nodes=2 --ntasks=8

and the scheduler grants you 2 nodes with 8 cores in total, not 16. Additionally the default core filling strategy is non-symmetric, so the scheduler will attempt to grant 11 cores on the first node and 1 on the second, unless you specify symmetry according to the following directives

--ntasks-per-socket or --distribution=cyclic:cyclic

The reason for this is that depending on your task your core may run faster with a certain arrangement of threads and SLURM gives you the option of fine tuning this to the nth degree. However if you just want cores on nodes then you can ignore these more complex directives.

To submit a job one runs sbatch <shell script> instead of qsub <shell script>. In fact migrations from PBS to SLURM are so common that other institutes have written translation pages for the two schedulers and SchedMD have designed several plugins that convert PBS scripts to SLURM. Attempting to run the qsub command without the plugins will result in the following error:

Another important difference is the coupling of OpenMPI and SLURM. One no longer references the mpirun wrapper but rather SLURM’s own srun wrapper. The assigned cores are read from the SLURM directives and do not need to be repeated for srun. One can even run srun directly from the head node command line as a job.

One other major change is that we’re forcing people to state up front what their wall time will be. The reason for this is that it allows the scheduler to more optimally assign additional jobs via the back-fill algorithm which makes the cluster more effective. It also ensures that the correct working of the fair-share factor in the Multifactor Priority job priority algorithm.

We are assigning users to this proof of concept cluster on a request basis for testing purposes as the proof of concept cluster’s storage is small, slow and not suited for production work.

ICTS eResearch is currently investigating hardware options for it’s next cluster to be housed at the data center at ICTS on Main. This will most likely consist of Intel based HPC nodes combined with seperate GPU nodes and Infiniband interconnect. There may also be a substantial portion of the cluster virtualized in the cloud.

Say hello to HAL

UCTHPC