If you intend submitting jobs to the GPU partitions you need to read this entire document carefully or your jobs will not run.
How to submit jobs
There are several GPU partitions. Each is made up of several worker nodes and each worker node has four GPU cards. You need to let the scheduler know that you require a GPU resource. This is done via the generic resource (gres) tag. The format is type:label:quantity. The type for the gres is gpu, the label is the model of GPU card you have access to, and quantity is how many cards you require. For example to reserve 3 L40S cards use:
--gres=gpu:l40s:3
In addition you will need to set the account directive which is not the same as your default account. This account name will be sent to you when you are granted access to the GPU resources and it will also be embedded in the gpu-example-script.sh file in your home folder.
Example
Via batch queue:
#SBATCH --partition=a100 #SBATCH --gres=gpu:ampere:1 #SBATCH --account=mygpugroup #SBATCH --ntasks=2
Via interactive:
sintx --partition=a100 --account=mygpugroup --ntasks=2 --gres=gpu:ampere:1
In the above examples the text mygpugroup must be replaced by the GPU group you were granted access to. This is not your user account. Please do not submit jobs to the GPU queue without a gres directive.
Server | Partition | Cards | Card Type | gres | Precision | Notes |
---|---|---|---|---|---|---|
srvrocgpu005 | gpumk | 4 | P100 | gpumk | Double | Compsci vaccine group only |
srvrocgpu006 | gpumk | 4 | P100 | gpumk | Double | Compsci vaccine group only |
srvrocgpu007 | gpumk | 4 | P100 | gpumk | Double | Compsci vaccine group only |
srvrocgpu008 | gpumk | 4 | P100 | gpumk | Double | Compsci vaccine group only |
srvrocgpu009 | a100 | 4 | A100-40 | amperemk | Double | Compsci vaccine group only |
srvrocgpu010 | a100 | 4 | A100-40 | ampere | Double | a100free \ aru groups |
srvrocgpu011 | a100 | 4 | A100-80 | ampere80 | Double | Compsci NLP group only |
srvrocgpu012 | l40s | 4 | L40S | l40s | Single | l40sfree group |
srvrocgpu013 | l40s | 4 | L40S | l40s | Single | l40sfree group |
srvrocgpu014 | l40s | 4 | L40S | l40s | Single | l40sfree group |
srvrocgpu015 | l40s | 4 | L40S | l40s | Single | l40sfree group |
Graphs and monitoring:
Feedback is important to determine how well your job is running. You can use qstat to list the queue but on the head node you may use the gpuqueue command to determine where your GPU job is running and also which instance it is utilizing
The HPC dashboard displays the GPU utilization percentage. The dashboard is updated once every minute.
There are graphs available which update once every minute.
Who may apply for access to these servers?
Groups from the Science and Engineering faculties contributed heavily to the cost of this resource. Members of these groups are granted access to resources proportional to their contribution, these levels being set by a committee of system owners. If you are a member of one of these groups you may apply for access by emailing the administrators and copying your supervisor.
ICTS contributed to the cost of several of the A100 and L40S cards and has ‘donated’ these instances to the general pool of researchers free of charge. However this pool is limited in the type and number of instances available, the wall time as well as the number of jobs that can be queued at any one time. To access either the a100free or l40sfree partitions please email the administrators.
Will more servers like these be purchased?
The costs of these servers are understandably higher than the average research group can afford. We would strongly encourage research groups to pool their resources in order to share the cost of these servers\cards. The money would then be transferred to ICTS who would purchase the server. The servers are housed in the UCT data center and are administered by the HPC staff who are also responsible for any repairs or replacements that may be required. HPC staff members can also facilitate the discussions around purchase and resource sharing.