HAL - UCT HPC

Submitting jobs to HAL
The ICTS High Performance Cluster uses SLURM to schedule jobs. There is one head node (hal.uct.ac.za) that users connect to in order to submit jobs. The /home partition on the head node is NFS mounted (i.e. common) to all worker nodes, regardless of series. Resources are assigned to partitions which can be thought of as queues.

Partition	Description	Nodes	Cores / node	Max cores / user	Time limit	Priority
ada	Standard partition	100-126	40	120	170 hours	20
swan	Large core partition	119-122	40	160	24 hours	30
curie	Long term partition	600-609	64	64	750 hours	20
gpuo	GPU partition	001-004	16/12/12/20	32	150 hours	20
gpumk	Private	005-008	32	Private	Private	Private
gpumka	Private	005-008	32	Private	Private	Private
a100	GPU partition	009-010	56	varies	varies	varies
grace	High memory	801-802	24	24	72 hours	20
sadacc-short	Private	127-134	44	176	1 hour	20
sadacc-long	Private	127-134	44	176	24 hours	30

Users are assigned to an account which is analogous to a group, normally their department or research group, for instance maths, compsci etc. This account is subordinate to a faculty account, Health, Science etc. A user may also be assigned to additional accounts. Resources such as MaxCPUs, CPUHours etc are assigned to accounts. These resources may override a higher account setting. Aggregate settings may also limit the number of resources that all members of lower level accounts may request. For example a user in the maths department may have access to 100 cores, even though the MaxCPUs for the Science faculty is 75 CPUs. However the aggregate Science faculty GrpCPUs limit may be 1000 which could still limit this user if lots of Science faculty users are running jobs. Additionally this user may also be part of a research group called cosmo that has contributed resources to the cluster; here the user is added to the cosmo account which gives the user access to 200 cores as long as the user specifies the cosmo account in their job submission. Accounts may also be limited to specific partitions, hence the user may submit to the maths group on partition uctlong, but may only submit to the cosmo account on partition cosmology.

Time format in SLURM:

Before starting it is important to understand the format of the time parameter to avoid ambiguity and confusion. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. This option applies to job and step allocations. Jobs on the Hal cluster won’t run unless a wall time is explicitly specified. We force you to enter a wall time for a job rather than rely on a default time because we want you to think carefully about the wall time limit for your jobs. This is your responsibility.

  Some examples:
    50 = 50 minutes
    50:00 = 50 minutes
    50:00:00 = 50 hours
    2-2 = 50 hours (2 days and 2 hours)
    2-2:00 = 50 hours (2 days and 2 hours)
    2-2:00:00 = 50 hours (2 days and 2 hours)

Basic jobs:

Create a shell script with parameters similar to the one below:

  #!/bin/sh
  #SBATCH --account maths
  #SBATCH --partition=uctlong
  #SBATCH --time=1000:00
  #SBATCH --nodes=2 --ntasks=3
  #SBATCH --mem-per-cpu=2000
  #SBATCH --job-name="My Maths Job"
  #SBATCH --mail-user=MyEmail@uct.ac.za
  #SBATCH --mail-type=ALL

  /opt/exp_soft/softwareX/xyz -o /home/fred/testA15/myfile.txt

You can then submit the job by typing: sbatch myscript.sh

The directives –nodes is the number of worker nodes or servers required and –ntasks is the total number of cores per job. If you specify nodes > 1 then at least one thread will be assigned to each of the additional servers. You do not need to cd to the directory from which the job is launched. If you wish to run more than one job at a time in the same folder you must ensure that each job’s output is directed to a different file, otherwise data files will conflict or overwrite one another. In the above example the second job’s output should be directed to myfile2.txt.

While the job runs on the worker node standard output and standard error (the screen output you’d see if you ran on a desktop) is written to a .out file. If the screen output of your software fills up the disk your job will fail. It is best to ensure that your job output is directed to a file in /home or /scratch, possibly with a command line argument or the linux redirect > function. In addition it is recommended that you disable all spurious or unnecessary program output to minimize on disk space usage, particularly for long job runs.

Memory control:

Like the CPU cores memory is a limited resource. The –mem-per-cpu directive allows you to specify how much RAM is needed. The scheduler will attempt to increase the –ntask value (the number of cores) in accordance with this directive where possible to try and maintain the RAM/core ratio.

Parallel jobs:

Parallel jobs write to one file system regardless of which worker node they start on. However this does mean that each job a user submits is required to start in a unique folder if the software that the job runs is not capable of specifying unique data files.

As an example, user fred has a home directory /home/fred/ on the head node, and this directory is also mounted on each worker node. This means that if fred created /home/fred/myfile.txt on the head node, this file is also immediately present on each worker node. Fred now submits a job. The job initially lands on node602. OpenMPI now also starts parallel versions of this job on nodes 601 and 603 for example. Each of the three nodes writes data to /home/fred/myfile.txt

If fred now submits another job and the software that fred is using cannot distinguish between concurrently running versions then data written to /home/fred/myfile.txt will be intermingled andor corrupted. Hence it is critical that non-concurrent capable software be launched from unique directories. If fred wants to run 3 concurrent jobs then the following need to be created: /home/fred/job1, /home/fred/job2 and /home/fred/job3. The shell script that controls the job must have a change directory command in it in order to select the correct directory.

The cluster uses OpenMPI to control parallel jobs. To launch a parallel aware program one generally uses mpirun, however as SLURM is tightly coupled with OpenMPI there are some distinctions to launching mpirun jobs manually; one does not need to specify a hostfile\machinefile nor does one need to specify the number of threads. Also be aware that unlike Torque\PBS there is no symmetrical geometry, if you request 2 nodes and 4 cores then SLURM will do the bare minimum to satisfy your request by running 3 threads on one node and 1 thread on the second. To retain symmetry use mincpus=X or ntasks-per-node=X where X is the number of threads per node you wish to use. It is critical that the shell script specifies how many servers (nodes) and CPUs will be reserved. This will inhibit other user’s jobs from trying to run on the same cores which would cause contention, slowing down both jobs. Use the #SBATCH directives to specify the nodes and cores.

  #!/bin/sh
  #SBATCH --account maths
  #SBATCH --partition=uctlong
  #SBATCH --time=1000:00
  #SBATCH --nodes=2 --ntasks=8 --ntasks-per-node=4
  #SBATCH --mem-per-cpu=2000
  #SBATCH --job-name="My Maths Job"
  #SBATCH --mail-user=MyEmail@uct.ac.za
  #SBATCH --mail-type=ALL

  mpirun /home/fred/mympiprog

This shell script tells SLURM that 2 nodes and a total of exactly 4 CPUs on each node should be reserved. Note that if –ntasks-per-node was not specified then on 8 core nodes the first node would have used 7 cores and the second node would have used 1 core. Unless specified the scheduler will not distribute the threads symmetrically. Mpirun is coupled to the scheduler and it is not necessary to specify a host file.

salloc:

The salloc command is used to interactively allocate a SLURM job allocation. When salloc successfully obtains the requested allocation, it then runs the command specified by the user. Finally, when the user specified command is complete, salloc relinquishes the job allocation. Entering the following at the head node returns a confirmation and prompt once resources are available:

  salloc --account maths --partition=ucthimem --time=1000:00 --nodes=1 --ntasks=1
  salloc: Granted job allocation 2060
  bob@srvcnthpc501:~$>

User bob can now use srun to issue commands which will run on the assigned resources, even though the prompt still indicates the head node.

  srvcnthpc500 ~$ srun cat /etc/hostname
  srvcnthpc406.uct.ac.za

Typing exit relinquishes the resources and ends the job.

  bob@srvcnthpc501:~$ exit
  exit
  salloc: Relinquishing job allocation 2060
  salloc: Job allocation 2060 has been revoked.
  bob@srvcnthpc501:~$

srun:

It is possible to launch a cluster job directly from the command line (or a script).

  srun -A maths --partition=uctlong  --time=1000:00  --nodes=1 --ntasks=1 /home/fred/myprog -o /home/fred/out.txt

The promptCLI is frozen until the job completes.

Interactive Jobs:

Use srun to launch bash, the command line interpreter:

  srun --pty --time=50:00:00 bash -l

Your prompt changes to that of a worker node and now any command you type is executed on that node. The commands do not need to be prefaced with the srun command. Type exit to end the job.

Checking job status
Type squeue to see a list of running jobs

andy@srvcnthpc501:~$ squeue

          JOBID PARTITION        NAME   USER  ST       TIME  NODES NODELIST(REASON)
          2143   uctlong   MyBatchJob   andy   S       0:00      1      (resources)
          2144  ucthimem   MPImemjobA   fred   R    2:25:02      2       hpc406,407
          2150  ucthimem   MPImemjobB   fred   R    1:15:27      2       hpc408,409

Here user andy wants to see why his job has stopped running, it’s most likely that user fred is consuming all the available resources. Note that the partition names are different, in SLURM servers can belong to multiple partitions and in this case it’s likely that andy’s job is now suspended because fred has access to a partition with higher priority.

Cancelling jobs
To cancel a job type scancel

UCTHPC