GPU partition

If you intend submitting jobs to the GPU partitions you need to read this entire document carefully or your jobs will not run.

How to submit jobs

There are several GPU partitions. Each is made up of several worker nodes and each worker node has four GPU cards. You need to let the scheduler know that you require a GPU resource. This is done via the generic resource (gres) tag. The format is type:label:quantity. The type for the gres is gpu, the label is the model of GPU card you have access to, and quantity is how many cards you require. For example to reserve a L40S card use:

--gres=gpu:l40s:1

In addition you will need to set the account directive which is not the same as your default account. This account name will be sent to you when you are granted access to the GPU resources and it will also be embedded in the gpu-example-script.sh file in your home folder.

Please do not reserve more than 1 GPU card per job unless you know for an absolute certainty that your code is capable of running on more than one GPU card simultaneously.

Example

Via batch queue:

#SBATCH --partition=a100
#SBATCH --gres=gpu:ampere:1
#SBATCH --account=mygpugroup
#SBATCH --ntasks=2

Via interactive:

sintx --partition=a100 --account=mygpugroup --ntasks=2 --gres=gpu:ampere:1

In the above examples the text mygpugroup must be replaced by the GPU group you were granted access to. This is not your user account. Please do not submit jobs to the GPU queue without a gres directive.

Server	Partition	Cards	Card Type	gres	Precision	Notes
srvrocgpu005	gpumk	4	P100	gpumk	Double	Compsci vaccine group only
srvrocgpu006	gpumk	4	P100	gpumk	Double	Compsci vaccine group only
srvrocgpu007	gpumk	4	P100	gpumk	Double	Compsci vaccine group only
srvrocgpu008	gpumk	4	P100	gpumk	Double	Compsci vaccine group only
srvrocgpu009	a100	4	A100-40	amperemk	Double	Compsci vaccine group only
srvrocgpu010	a100	4	A100-40	ampere	Double	a100free \ aru groups
srvrocgpu011	a100	4	A100-80	ampere80	Double	Compsci NLP group only
srvrocgpu012	l40s	4	L40S	l40s	Single	l40sfree group
srvrocgpu013	l40s	4	L40S	l40s	Single	l40sfree group
srvrocgpu014	l40s	4	L40S	l40s	Single	l40sfree group
srvrocgpu015	l40s	4	L40S	l40s	Single	l40sfree group

Graphs and monitoring:

Feedback is important to determine how well your job is running. You can use qstat to list the queue but on the head node you may use the gpuqueue command to determine where your GPU job is running and also which instance it is utilizing

The HPC dashboard displays the GPU utilization percentage. The dashboard is updated once every minute.

There are graphs available which update once every minute.

Who may apply for access to these servers?

Groups from the Science and Engineering faculties contributed heavily to the cost of this resource. Members of these groups are granted access to resources proportional to their contribution, these levels being set by a committee of system owners. If you are a member of one of these groups you may apply for access by emailing the administrators and copying your supervisor.

ICTS contributed to the cost of several of the A100 and L40S cards and has ‘donated’ these instances to the general pool of researchers free of charge. However this pool is limited in the type and number of instances available, the wall time as well as the number of jobs that can be queued at any one time. To access either the a100free or l40sfree partitions please email the administrators.

Will more servers like these be purchased?

The costs of these servers are understandably higher than the average research group can afford. We would strongly encourage research groups to pool their resources in order to share the cost of these servers\cards. The money would then be transferred to ICTS who would purchase the server. The servers are housed in the UCT data center and are administered by the HPC staff who are also responsible for any repairs or replacements that may be required. HPC staff members can also facilitate the discussions around purchase and resource sharing.

How to submit jobs

Example

Graphs and monitoring:

Who may apply for access to these servers?

Will more servers like these be purchased?

UCTHPC