{"version":"1.0","provider_name":"UCT HPC","provider_url":"https:\/\/ucthpc.uct.ac.za","author_name":"Andrew Lewis","author_url":"https:\/\/ucthpc.uct.ac.za\/index.php\/author\/andrew-lewis\/","title":"GPUs and GRES - UCT HPC","type":"rich","width":600,"height":338,"html":"<blockquote class=\"wp-embedded-content\" data-secret=\"CDoZuh4du3\"><a href=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2015\/06\/22\/gpus-and-gres\/\">GPUs and GRES<\/a><\/blockquote><iframe sandbox=\"allow-scripts\" security=\"restricted\" src=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2015\/06\/22\/gpus-and-gres\/embed\/#?secret=CDoZuh4du3\" width=\"600\" height=\"338\" title=\"&#8220;GPUs and GRES&#8221; &#8212; UCT HPC\" data-secret=\"CDoZuh4du3\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" class=\"wp-embedded-content\"><\/iframe><script type=\"text\/javascript\">\n\/*! This file is auto-generated *\/\n!function(c,d){\"use strict\";var e=!1,o=!1;if(d.querySelector)if(c.addEventListener)e=!0;if(c.wp=c.wp||{},c.wp.receiveEmbedMessage);else if(c.wp.receiveEmbedMessage=function(e){var t=e.data;if(!t);else if(!(t.secret||t.message||t.value));else if(\/[^a-zA-Z0-9]\/.test(t.secret));else{for(var r,s,a,i=d.querySelectorAll('iframe[data-secret=\"'+t.secret+'\"]'),n=d.querySelectorAll('blockquote[data-secret=\"'+t.secret+'\"]'),o=new RegExp(\"^https?:$\",\"i\"),l=0;l<n.length;l++)n[l].style.display=\"none\";for(l=0;l<i.length;l++)if(r=i[l],e.source!==r.contentWindow);else{if(r.removeAttribute(\"style\"),\"height\"===t.message){if(1e3<(s=parseInt(t.value,10)))s=1e3;else if(~~s<200)s=200;r.height=s}if(\"link\"===t.message)if(s=d.createElement(\"a\"),a=d.createElement(\"a\"),s.href=r.getAttribute(\"src\"),a.href=t.value,!o.test(a.protocol));else if(a.host===s.host)if(d.activeElement===r)c.top.location.href=t.value}}},e)c.addEventListener(\"message\",c.wp.receiveEmbedMessage,!1),d.addEventListener(\"DOMContentLoaded\",t,!1),c.addEventListener(\"load\",t,!1);function t(){if(o);else{o=!0;for(var e,t,r,s=-1!==navigator.appVersion.indexOf(\"MSIE 10\"),a=!!navigator.userAgent.match(\/Trident.*rv:11\\.\/),i=d.querySelectorAll(\"iframe.wp-embedded-content\"),n=0;n<i.length;n++){if(!(r=(t=i[n]).getAttribute(\"data-secret\")))r=Math.random().toString(36).substr(2,10),t.src+=\"#?secret=\"+r,t.setAttribute(\"data-secret\",r);if(s||a)(e=t.cloneNode(!0)).removeAttribute(\"security\"),t.parentNode.replaceChild(e,t);t.contentWindow.postMessage({message:\"ready\",secret:r},\"*\")}}}}(window,document);\n<\/script>\n","description":"Our current cluster, hex, runs Torque with MAUI as the scheduler. While MAUI is GPU aware it does not allow GPUs&nbsp;to be scheduled. In other words you can list the nodes with GPUs but you cannot submit a job based on these resources, nor can you lock a&nbsp;GPU. The MOAB scheduler for Torque can do this, but the license costs several hundred thousand dollars. Fortunately SLURM has this functionality built into it, and what's more is it's free. GPU cards are defined as generic resource (GRES) objects and are listed by type and number. Each GPU card is assigned to a certain number of cores in a server. One needs to enter the following line:GresTypes=gpuin the slurm.conf file and also add Gres information to the node configurations, for example:NodeName=hpc406 ... Gres=gpu:kepler:2One must also create a gres.conf file on the nodes that actually house the GPU cards:Name=gpu Type=kepler File=\/dev\/nvidia0 CPUs=0,1,2,3Name=gpu Type=kepler File=\/dev\/nvidia1 CPUs=4,5,6,7This indicates which cores are assigned to which card.&nbsp;To request a GPU resource one enters the following requirement in sbatch, salloc or srun:#SBATCH --gres=gpu:2 --nodes=1 --ntasks=1When the job runs an environment variable is set:CUDA_VISIBLE_DEVICES=0,1depending on how many GPU cards have been requested. Below are three jobs all requesting 2 cores and a single GPU card. Only two are running even though there are cores free:JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1937 &nbsp;ucthimem GresTest &nbsp; andy PD &nbsp; &nbsp; 0:00 &nbsp; &nbsp; &nbsp;(Resources)&nbsp;1935 &nbsp;ucthimem GresTest &nbsp;&nbsp;andy&nbsp;R &nbsp; &nbsp; &nbsp; 1:08 &nbsp; &nbsp; &nbsp;hpc406&nbsp;1936 &nbsp;ucthimem GresTest &nbsp;&nbsp;andy&nbsp;R &nbsp; &nbsp; &nbsp; 1:08 &nbsp; &nbsp; &nbsp;hpc406Examining 1935 shows us that cores are set to CPU_IDs=1-2 while 1936's cores are set to CPU_IDs=4-5. Additionally CUDA_VISIBLE_DEVICES=0 and CUDA_VISIBLE_DEVICES=1 are set for jobs 1935 and 1936 respectively."}