<?xml version="1.0"?>
<oembed><version>1.0</version><provider_name>UCT HPC</provider_name><provider_url>https://ucthpc.uct.ac.za</provider_url><author_name>Andrew Lewis</author_name><author_url>https://ucthpc.uct.ac.za/index.php/author/andrew-lewis/</author_url><title>SLURM job preemption - UCT HPC</title><type>rich</type><width>600</width><height>338</height><html>&lt;blockquote class="wp-embedded-content" data-secret="441OZGWDTy"&gt;&lt;a href="https://ucthpc.uct.ac.za/index.php/2015/06/19/slurm-job-preemption/"&gt;SLURM job preemption&lt;/a&gt;&lt;/blockquote&gt;&lt;iframe sandbox="allow-scripts" security="restricted" src="https://ucthpc.uct.ac.za/index.php/2015/06/19/slurm-job-preemption/embed/#?secret=441OZGWDTy" width="600" height="338" title="&#x201C;SLURM job preemption&#x201D; &#x2014; UCT HPC" data-secret="441OZGWDTy" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" class="wp-embedded-content"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;
/*! This file is auto-generated */
!function(c,d){"use strict";var e=!1,o=!1;if(d.querySelector)if(c.addEventListener)e=!0;if(c.wp=c.wp||{},c.wp.receiveEmbedMessage);else if(c.wp.receiveEmbedMessage=function(e){var t=e.data;if(!t);else if(!(t.secret||t.message||t.value));else if(/[^a-zA-Z0-9]/.test(t.secret));else{for(var r,s,a,i=d.querySelectorAll('iframe[data-secret="'+t.secret+'"]'),n=d.querySelectorAll('blockquote[data-secret="'+t.secret+'"]'),o=new RegExp("^https?:$","i"),l=0;l&lt;n.length;l++)n[l].style.display="none";for(l=0;l&lt;i.length;l++)if(r=i[l],e.source!==r.contentWindow);else{if(r.removeAttribute("style"),"height"===t.message){if(1e3&lt;(s=parseInt(t.value,10)))s=1e3;else if(~~s&lt;200)s=200;r.height=s}if("link"===t.message)if(s=d.createElement("a"),a=d.createElement("a"),s.href=r.getAttribute("src"),a.href=t.value,!o.test(a.protocol));else if(a.host===s.host)if(d.activeElement===r)c.top.location.href=t.value}}},e)c.addEventListener("message",c.wp.receiveEmbedMessage,!1),d.addEventListener("DOMContentLoaded",t,!1),c.addEventListener("load",t,!1);function t(){if(o);else{o=!0;for(var e,t,r,s=-1!==navigator.appVersion.indexOf("MSIE 10"),a=!!navigator.userAgent.match(/Trident.*rv:11\./),i=d.querySelectorAll("iframe.wp-embedded-content"),n=0;n&lt;i.length;n++){if(!(r=(t=i[n]).getAttribute("data-secret")))r=Math.random().toString(36).substr(2,10),t.src+="#?secret="+r,t.setAttribute("data-secret",r);if(s||a)(e=t.cloneNode(!0)).removeAttribute("security"),t.parentNode.replaceChild(e,t);t.contentWindow.postMessage({message:"ready",secret:r},"*")}}}}(window,document);
&lt;/script&gt;
</html><description>SLURM provides a preemption mechanism to deal with situations where cluster become overloaded. This can be configured in several ways:FIFO:This is the most simplistic method of queueing in which there is no preemption, jobs come in, queue and are dealt with in that order. Backfill scheduling is also enabled by default and this allows advanced scheduling of jobs as long as they won't delay starting jobs ahead of them in the queue.Preemption via Thread Control:Here priority is based on the partition that the user submits jobs to. Core sharing is disabled and PreemptMode is set to SUSPEND. Below is a case where all cores used...JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1807 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;5:28 &nbsp;hpc402&nbsp;1808 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;5:25 &nbsp;hpc402&nbsp;1809 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;5:25 &nbsp;hpc402&nbsp;1811 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;4:53 &nbsp;hpc401&nbsp;1812 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;3:07 &nbsp;hpc401&nbsp;1813 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;1:30 &nbsp;hpc401User andy submits a job to the ucthi partition...JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1808 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;5:34 &nbsp;hpc402&nbsp;1809 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;5:34 &nbsp;hpc402&nbsp;1811 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;5:02 &nbsp;hpc401&nbsp;1812 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;3:16 &nbsp;hpc401&nbsp;1813 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;1:39 &nbsp;hpc401&nbsp;1814 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:01 &nbsp;hpc402&nbsp;1807 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;S &nbsp; &nbsp;5:36 &nbsp;hpc402One of bob's jobs is now suspended while user andy's job runs. However here bob's job 1807 is still consuming RAM which may be a problem if user Andy's job also needs lots of RAM. In the case where RAM is an issue it is best to cancel and resubmit low priority jobs.&nbsp;Here core sharing disabled and PreemptMode is set to REQUEUE. User bob is running jobs on all available cores on uctlo partition...JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1794 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:33 &nbsp;hpc401&nbsp;1795 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:30 &nbsp;hpc401&nbsp;1796 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:30 &nbsp;hpc401&nbsp;1797 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:30 &nbsp;hpc402&nbsp;1798 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:30 &nbsp;hpc402&nbsp;1799 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:27 &nbsp;hpc402User andy starts submitting jobs to the ucthi partition, bob's jobs are cancelled...JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1794 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob CG &nbsp; &nbsp;0:00 &nbsp;hpc401&nbsp;1795 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:49 &nbsp;hpc401&nbsp;1796 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:49 &nbsp;hpc401&nbsp;1797 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:49 &nbsp;hpc402&nbsp;1798 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:49 &nbsp;hpc402&nbsp;1799 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:46 &nbsp;hpc402&nbsp;1801 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:01 &nbsp;hpc4013 of bob's jobs have been cancelled and resubmitted in pending state while andy's 3 jobs are now running...&nbsp;1794 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob PD &nbsp; &nbsp;0:00 &nbsp;(BeginTime)&nbsp;1795 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob PD &nbsp; &nbsp;0:00 &nbsp;(BeginTime)&nbsp;1796 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob PD &nbsp; &nbsp;0:00 &nbsp;(BeginTime)&nbsp;1797 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:59 &nbsp;hpc402&nbsp;1798 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:59 &nbsp;hpc402&nbsp;1799 &nbsp; &nbsp; uctlo MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:56 &nbsp;hpc402&nbsp;1801 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:11 &nbsp;hpc401&nbsp;1802 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:06 &nbsp;hpc401&nbsp;1803 &nbsp; &nbsp; ucthi MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:02 &nbsp;hpc401In this instance it is possible to oversubscribe the cores on a node and hence memory limits need to be protected by setting memory as a consumable resource. With this scheduling methods jobs may still not run if memory is set as a consumable resource and there is insufficient RAM per core. A default RAM/core/job is set in the configuration file, but users can set there own requirements within the bounds of their account settings.Preemption via Gang scheduling:Here users andy and bob have filled up all available cores and some of andy's jobs are suspended.JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1815 &nbsp;ucthimem MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:50 &nbsp;hpc406&nbsp;1816 &nbsp;ucthimem MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:47 &nbsp;hpc406&nbsp;1817 &nbsp;ucthimem MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:47 &nbsp;hpc407&nbsp;1820 &nbsp;ucthimem MemTestA &nbsp; alewis &nbsp;R &nbsp; &nbsp;0:19 &nbsp;hpc407&nbsp;1821 &nbsp;ucthimem MemTestA &nbsp; alewis &nbsp;R &nbsp; &nbsp;0:16 &nbsp;hpc408&nbsp;1822 &nbsp;ucthimem MemTestA &nbsp; alewis &nbsp;R &nbsp; &nbsp;0:16 &nbsp;hpc408&nbsp;1823 &nbsp;ucthimem MemTestA &nbsp; alewis &nbsp;S &nbsp; &nbsp;0:00 &nbsp;hpc406&nbsp;1824 &nbsp;ucthimem MemTestA &nbsp; alewis &nbsp;S &nbsp; &nbsp;0:00 &nbsp;hpc406After a time slice has passed some of bob's jobs are suspended and andy's jobs run.JOBID PARTITION &nbsp; &nbsp; NAME &nbsp; &nbsp; USER ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1817 &nbsp;ucthimem MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;R &nbsp; &nbsp;0:47 &nbsp;hpc407&nbsp;1820 &nbsp;ucthimem MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:19 &nbsp;hpc407&nbsp;1821 &nbsp;ucthimem MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:16 &nbsp;hpc408&nbsp;1822 &nbsp;ucthimem MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:16 &nbsp;hpc408&nbsp;1823 &nbsp;ucthimem MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:00 &nbsp;hpc406&nbsp;1824 &nbsp;ucthimem MemTestA &nbsp; &nbsp; andy &nbsp;R &nbsp; &nbsp;0:00 &nbsp;hpc406&nbsp;1815 &nbsp;ucthimem MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;S &nbsp; &nbsp;0:50 &nbsp;hpc406&nbsp;1816 &nbsp;ucthimem MemTestB &nbsp; &nbsp; &nbsp;bob &nbsp;S &nbsp; &nbsp;0:47 &nbsp;hpc406This is repeated until all jobs are completed. The time slice between job suspensions is 60 seconds by default. In the case of user andy submitting to a higher priority partition this scheduling scheme reverts to standard job preemption and bob's jobs are suspended indefinitely until cores become free. Once again cores can be oversubscribed and memory needs to be protected.Multifactor preemption:This is the most complex form of preemption. Here the job priority is based upon a complex factoring algorithmJob_priority =(PriorityWeightAge) * (age_factor) +(PriorityWeightFairshare) * (fair-share_factor) +(PriorityWeightJobSize) * (job_size_factor) +(PriorityWeightPartition) * (partition_factor) +(PriorityWeightQOS) * (QOS_factor)You are encouraged to read the&nbsp;official documentation on how this can be configured. Below is a simple example based only on user QOS for priority. User bob is consuming all available cores...JOBID PARTITION &nbsp; &nbsp; &nbsp; &nbsp;NAME &nbsp;USER &nbsp;ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1893 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:09 &nbsp;hpc402&nbsp;1892 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:12 &nbsp;hpc402&nbsp;1889 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:15 &nbsp;hpc401&nbsp;1890 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:15 &nbsp;hpc401&nbsp;1891 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:15 &nbsp;hpc402&nbsp;1888 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:18 &nbsp;hpc401User andy submits a job which won't run as there are no resources free. As andy's job has a higher priority sufficient jobs of bob's are cancelled to allow it to run...JOBID PARTITION &nbsp; &nbsp; &nbsp; &nbsp;NAME &nbsp;USER &nbsp;ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1889 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;CG &nbsp; &nbsp;0:00 &nbsp;hpc401&nbsp;1890 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;CG &nbsp; &nbsp;0:00 &nbsp;hpc401&nbsp;1888 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;CG &nbsp; &nbsp;0:00 &nbsp;hpc401&nbsp;1894 &nbsp; &nbsp; ucthi &nbsp; CoreTestA &nbsp;andy &nbsp;PD &nbsp; &nbsp;0:00 &nbsp;(Resources)&nbsp;1893 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:13 &nbsp;hpc402&nbsp;1892 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:16 &nbsp;hpc402&nbsp;1891 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:19 &nbsp;hpc402Once andy's job is running bob's jobs are requeued...JOBID PARTITION &nbsp; &nbsp; &nbsp; &nbsp;NAME &nbsp;USER &nbsp;ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1888 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;PD &nbsp; &nbsp;0:00 &nbsp;(BeginTime)&nbsp;1889 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;PD &nbsp; &nbsp;0:00 &nbsp;(BeginTime)&nbsp;1890 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;PD &nbsp; &nbsp;0:00 &nbsp;(BeginTime)&nbsp;1894 &nbsp; &nbsp; ucthi &nbsp; CoreTestA &nbsp;andy &nbsp; R &nbsp; &nbsp;0:03 &nbsp;hpc401&nbsp;1893 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:17 &nbsp;hpc402&nbsp;1892 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:20 &nbsp;hpc402&nbsp;1891 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;0:23 &nbsp;hpc402After a short time the backfill scheduler allows one of bob's restarted jobs to run. This is because andy's job needs 2 nodes but bob's jobs only need one node.JOBID PARTITION &nbsp; &nbsp; &nbsp; &nbsp;NAME &nbsp;USER &nbsp;ST &nbsp; &nbsp;TIME &nbsp;NODELIST&nbsp;1890 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;PD &nbsp; &nbsp;0:00 &nbsp;(Priority)&nbsp;1889 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp;PD &nbsp; &nbsp;0:00 &nbsp;(Resources)&nbsp;1894 &nbsp; &nbsp; ucthi &nbsp; CoreTestA &nbsp;andy &nbsp; R &nbsp; &nbsp;2:41 &nbsp;hpc401&nbsp;1888 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;2:16 &nbsp;hpc401&nbsp;1893 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;2:55 &nbsp;hpc402&nbsp;1892 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;2:58 &nbsp;hpc402&nbsp;1891 &nbsp; &nbsp; ucthi &nbsp; CoreTestB &nbsp; bob &nbsp; R &nbsp; &nbsp;3:01 &nbsp;hpc402Unfortunately this preemption mode is not compatible with thread suspension.</description></oembed>
