The ICTS High Performance Cluster uses Torque PBS and Maui to schedule jobs. There is one head node (hpc.uct.ac.za) that users connect to in order to submit jobs. The /home partition on the head node is NFS mounted (i.e. common) to all worker nodes, regardless of series.
Basic jobs:
Create a shell script with parameters similar to the one below:
#PBS -q UCTlong #PBS -l nodes=1:ppn=1:series200 #PBS -N MyRunTest-A15 /opt/exp_soft/softwareX/xyz -o /home/fred/testA15/myfile.txt
You can then submit the job by typing: qsub myscript.sh
NB1: If your data files are in a sub folder you must either reference them explicitly as in the above example or use the cd command to change to the correct sub folder.
NB2: If you wish to run more than one job at a time in the same folder you must ensure that each job’s output is directed to a different file, otherwise data files will conflict or overwrite one another. In the above example the second job’s output should be directed to myfile2.txt.
NB3: If your software prefers to use all cores on a computer then make sure that you reserve these cores. For example running on the 200 series server which have 4 cores per server change the directive line in your script as follows:
#PBS -l nodes=1:ppn=4:series200
While the job runs on the worker node standard output and standard error (the screen output you’d see if you ran on a desktop) is written to two files, .o and .e. The worker nodes have local disk normally in the order of 50GB. If the screen output of your software fills up the worker node local disk your job will fail. It is best to ensure that your job output is directed to a file in /home or /scratch, possibly with a command line argument or the linux redirect > function.
Parallel jobs:
The advantage of an NFS file system is that parallel jobs write to one file system regardless of which worker node they start on. However this does mean that each job a user submits is required to start in a unique folder if the software that the job runs is not capable of specifying unique data files.
As an example, user fred has a home directory /home/fred/ on the head node, and this directory is also mounted on each worker node. This means that if fred created /home/fred/myfile.txt on the head node, this file is also immediately present on each worker node. Fred now submits a job. The job initially lands on node400. OpenMPI now also starts parallel versions of this job on nodes 401 and 402 for example. Each of the three nodes writes data to /home/fred/myfile.txt
If Fred now submits another job and the software that Fred is using cannot distinguish between concurrently running versions then data written to /home/fred/myfile.txt will be intermingled andor corrupted. Hence it is critical that non-concurrent capable software be launched from unique directories. If fred wants to run 3 concurrent jobs then the following need to be created: /home/fred/job1, /home/fred/job2 and /home/fred/job3. The shell script that controls the job must have a change directory command in it in order to select the correct directory.
The cluster uses OpenMPI to control parallel jobs. To launch a parallel aware program one generally uses mpirun. It is critical that the shell script specifies how many servers (nodes) and CPUs will be reserved. This will inhibit other user’s jobs from trying to run on the same cores which would cause contention, slowing down both jobs. Use the #PBS directives to specify the nodes and cores. The #PBS directives can be used for other purposes too, such as giving the job a sensible name. Other #PBS directives can be found in the Torque documentation.
Once a shell script is created it can be submitted to a queue by using qsub. For example, the following script, myrun.sh:
#PBS -q UCTlong #PBS -l nodes=2:ppn=4:series400 #PBS -N MyRunTest-A15 cd testA15 mpirun -hostfile $PBS_NODEFILE /opt/exp_soft/softwareX/xyz -o /home/fred/testA15/myfile.txt
would be submitted with the following command:
qsub myrun.sh
The shell script myrun.sh tells Torque that 2 servers and a total of 8 CPUs on each node should be reserved and that the servers should be part of the 400 series. It names the job, calls the correct directory and then invokes mpirun. PBS provides its own hostfile (or machine.file), do not use your own file. The hostfile is referenced via the environment variable $PBS_NODEFILE. As PBS and MPI are integrated it it not necessary to use the -np parameter for mpirun. The final arguments are the executable file (xyz) and then any arguments to the executable, for example myfile.txt.
As another example here is a shell script that reserves 20 CPUs on the 200 series:
#PBS -q UCTlong #PBS -l nodes=10:ppn=2:series200 #PBS -N MyLargeRun cd LargeRun mpirun -hostfile $PBS_NODEFILE /opt/exp_soft/softwareX/xyz -o /home/fred/LargeRun/myfile2.txt
It is important to include the series identifier to ensure that all instances of the executable behave consistently.
Once the job is running a user can check the status using qstat, or by periodically looking at the HPC dashboard on the head node. To cancel a job use qdel and the job ID number. Once a job terminates all standard output will be written to a .o[jobID] file while any error messages are written to a .e[jobID] file. Other useful output should be wherever the user’s software normally directs it.
It is important to not exceed the CPU and Wall times of the queue, otherwise the job will terminate. User’s can check the maximum CPU and Wall times on the dashboard. If it is suspected that these may be exceeded it would be wise to discuss this with the cluster administrators.
Side note: before you can make use of OpenMPI you will need to run openmpi-selector-menu from your shell. This will give you the option of selecting the correct version of MPI to run and save this in the .mpi-selector file in your home directory.