NB. If you intend submitting jobs to the a100 partition you need to read this entire document extremely carefully or your jobs will not run.
The a100 partition is made up of servers srvcntgpu009, srvcntgpu010 and srvcntgpu011. Each server has 56 CPU cores and 4 A100 Nvidia GPU cards.
Currently the HPC a100 partition servers are split as follows:
GPU009 (40GB cards) (restricted to Compsci Vaccine group only)
GPU0: amperemk GPU1: amperemk GPU2: amperemk GPU3: amperemk
GPU010 (40GB cards)
GPU0: ampere GPU1: ampere GPU2: ampere GPU3: ampere
GPU011 (80GB cards)
GPU0: ampere80 GPU1: ampere80 GPU2: ampere80 GPU3: ampere80
How to submit jobs
You need to let the scheduler know that you require a GPU resource. This is done via the generic resource (gres) tag. The format is type:label:quantity.
Via batch queue:
#SBARCH --partition=a100 #SBATCH --gres=gpu:ampere:1 #SBATCH --account=mygpugroup #SBATCH --ntasks=2
Via interactive:
sintx --partition=a100 --account=mygpugroup --ntasks=2 --gres=gpu:ampere:1
In the above examples the text mygpugroup must be replaced by the GPU group you were granted access to. This is not your user account.
Please be extremely careful with the interactive command as an incorrect request can cause problems with the scheduler.
Graphs and monitoring:
Feedback is important to determine how well your job is running. On the head node you may use the a100cores command to determine where your job is running and also which GPU instance it is utilizing
The HPC dashboard displays the GPU utilization percentage. The dashboard is updated once every minute.
There are graphs available which update once every minute.
Who may apply for access to this partition?
Groups from the Science and Engineering faculties contributed heavily to the cost of this resource. Members of these groups are granted access to resources proportional to their contribution, these levels being set by a committee of system owners.
ICTS contributed to the cost of several of the cards and has ‘donated’ these instances to the general pool of researchers free of charge. However this pool is limited in the type and number of instances available, the wall time as well as the number of jobs that can be queued at any one time.
Will more servers like these be purchased?
The cost of one of these servers is approximately R1.4 million, understandably higher than the average research group can afford. We would strongly encourage research groups to pool their resources in order to share the cost of these servers\cards. The money would then be transferred to ICTS who would purchase the server. The servers are housed in the UCT data center and are administered by the HPC staff who are also responsible for any repairs or replacements that may be required. HPC staff members can also facilitate the discussions around purchase and resource sharing.