SLURM provides a preemption mechanism to deal with situations where cluster become overloaded. This can be configured in several ways:
FIFO:
This is the most simplistic method of queueing in which there is no preemption, jobs come in, queue and are dealt with in that order. Backfill scheduling is also enabled by default and this allows advanced scheduling of jobs as long as they won’t delay starting jobs ahead of them in the queue.
Preemption via Thread Control:
Here priority is based on the partition that the user submits jobs to. Core sharing is disabled and PreemptMode is set to SUSPEND. Below is a case where all cores used…
JOBID PARTITION NAME USER ST TIME NODELIST 1807 uctlo MemTestB bob R 5:28 hpc402 1808 uctlo MemTestB bob R 5:25 hpc402 1809 uctlo MemTestB bob R 5:25 hpc402 1811 ucthi MemTestA andy R 4:53 hpc401 1812 ucthi MemTestA andy R 3:07 hpc401 1813 ucthi MemTestA andy R 1:30 hpc401
User andy submits a job to the ucthi partition…
JOBID PARTITION NAME USER ST TIME NODELIST 1808 uctlo MemTestB bob R 5:34 hpc402 1809 uctlo MemTestB bob R 5:34 hpc402 1811 ucthi MemTestA andy R 5:02 hpc401 1812 ucthi MemTestA andy R 3:16 hpc401 1813 ucthi MemTestA andy R 1:39 hpc401 1814 ucthi MemTestA andy R 0:01 hpc402 1807 uctlo MemTestB bob S 5:36 hpc402
One of bob’s jobs is now suspended while user andy’s job runs. However here bob’s job 1807 is still consuming RAM which may be a problem if user Andy’s job also needs lots of RAM. In the case where RAM is an issue it is best to cancel and resubmit low priority jobs.
Here core sharing disabled and PreemptMode is set to REQUEUE. User bob is running jobs on all available cores on uctlo partition…
JOBID PARTITION NAME USER ST TIME NODELIST 1794 uctlo MemTestB bob R 0:33 hpc401 1795 uctlo MemTestB bob R 0:30 hpc401 1796 uctlo MemTestB bob R 0:30 hpc401 1797 uctlo MemTestB bob R 0:30 hpc402 1798 uctlo MemTestB bob R 0:30 hpc402 1799 uctlo MemTestB bob R 0:27 hpc402
User andy starts submitting jobs to the ucthi partition, bob’s jobs are cancelled…
JOBID PARTITION NAME USER ST TIME NODELIST 1794 uctlo MemTestB bob CG 0:00 hpc401 1795 uctlo MemTestB bob R 0:49 hpc401 1796 uctlo MemTestB bob R 0:49 hpc401 1797 uctlo MemTestB bob R 0:49 hpc402 1798 uctlo MemTestB bob R 0:49 hpc402 1799 uctlo MemTestB bob R 0:46 hpc402 1801 ucthi MemTestA andy R 0:01 hpc401
3 of bob’s jobs have been cancelled and resubmitted in pending state while andy’s 3 jobs are now running…
1794 uctlo MemTestB bob PD 0:00 (BeginTime) 1795 uctlo MemTestB bob PD 0:00 (BeginTime) 1796 uctlo MemTestB bob PD 0:00 (BeginTime) 1797 uctlo MemTestB bob R 0:59 hpc402 1798 uctlo MemTestB bob R 0:59 hpc402 1799 uctlo MemTestB bob R 0:56 hpc402 1801 ucthi MemTestA andy R 0:11 hpc401 1802 ucthi MemTestA andy R 0:06 hpc401 1803 ucthi MemTestA andy R 0:02 hpc401
In this instance it is possible to oversubscribe the cores on a node and hence memory limits need to be protected by setting memory as a consumable resource. With this scheduling methods jobs may still not run if memory is set as a consumable resource and there is insufficient RAM per core. A default RAM/core/job is set in the configuration file, but users can set there own requirements within the bounds of their account settings.
Preemption via Gang scheduling:
Here users andy and bob have filled up all available cores and some of andy’s jobs are suspended.
JOBID PARTITION NAME USER ST TIME NODELIST 1815 ucthimem MemTestB bob R 0:50 hpc406 1816 ucthimem MemTestB bob R 0:47 hpc406 1817 ucthimem MemTestB bob R 0:47 hpc407 1820 ucthimem MemTestA alewis R 0:19 hpc407 1821 ucthimem MemTestA alewis R 0:16 hpc408 1822 ucthimem MemTestA alewis R 0:16 hpc408 1823 ucthimem MemTestA alewis S 0:00 hpc406 1824 ucthimem MemTestA alewis S 0:00 hpc406
After a time slice has passed some of bob’s jobs are suspended and andy’s jobs run.
JOBID PARTITION NAME USER ST TIME NODELIST 1817 ucthimem MemTestB bob R 0:47 hpc407 1820 ucthimem MemTestA andy R 0:19 hpc407 1821 ucthimem MemTestA andy R 0:16 hpc408 1822 ucthimem MemTestA andy R 0:16 hpc408 1823 ucthimem MemTestA andy R 0:00 hpc406 1824 ucthimem MemTestA andy R 0:00 hpc406 1815 ucthimem MemTestB bob S 0:50 hpc406 1816 ucthimem MemTestB bob S 0:47 hpc406
This is repeated until all jobs are completed. The time slice between job suspensions is 60 seconds by default. In the case of user andy submitting to a higher priority partition this scheduling scheme reverts to standard job preemption and bob’s jobs are suspended indefinitely until cores become free. Once again cores can be oversubscribed and memory needs to be protected.
Multifactor preemption:
This is the most complex form of preemption. Here the job priority is based upon a complex factoring algorithm
Job_priority = (PriorityWeightAge) * (age_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor)
You are encouraged to read the official documentation on how this can be configured. Below is a simple example based only on user QOS for priority. User bob is consuming all available cores…
JOBID PARTITION NAME USER ST TIME NODELIST 1893 ucthi CoreTestB bob R 0:09 hpc402 1892 ucthi CoreTestB bob R 0:12 hpc402 1889 ucthi CoreTestB bob R 0:15 hpc401 1890 ucthi CoreTestB bob R 0:15 hpc401 1891 ucthi CoreTestB bob R 0:15 hpc402 1888 ucthi CoreTestB bob R 0:18 hpc401
User andy submits a job which won’t run as there are no resources free. As andy’s job has a higher priority sufficient jobs of bob’s are cancelled to allow it to run…
JOBID PARTITION NAME USER ST TIME NODELIST 1889 ucthi CoreTestB bob CG 0:00 hpc401 1890 ucthi CoreTestB bob CG 0:00 hpc401 1888 ucthi CoreTestB bob CG 0:00 hpc401 1894 ucthi CoreTestA andy PD 0:00 (Resources) 1893 ucthi CoreTestB bob R 0:13 hpc402 1892 ucthi CoreTestB bob R 0:16 hpc402 1891 ucthi CoreTestB bob R 0:19 hpc402
Once andy’s job is running bob’s jobs are requeued…
JOBID PARTITION NAME USER ST TIME NODELIST 1888 ucthi CoreTestB bob PD 0:00 (BeginTime) 1889 ucthi CoreTestB bob PD 0:00 (BeginTime) 1890 ucthi CoreTestB bob PD 0:00 (BeginTime) 1894 ucthi CoreTestA andy R 0:03 hpc401 1893 ucthi CoreTestB bob R 0:17 hpc402 1892 ucthi CoreTestB bob R 0:20 hpc402 1891 ucthi CoreTestB bob R 0:23 hpc402
After a short time the backfill scheduler allows one of bob’s restarted jobs to run. This is because andy’s job needs 2 nodes but bob’s jobs only need one node.
JOBID PARTITION NAME USER ST TIME NODELIST 1890 ucthi CoreTestB bob PD 0:00 (Priority) 1889 ucthi CoreTestB bob PD 0:00 (Resources) 1894 ucthi CoreTestA andy R 2:41 hpc401 1888 ucthi CoreTestB bob R 2:16 hpc401 1893 ucthi CoreTestB bob R 2:55 hpc402 1892 ucthi CoreTestB bob R 2:58 hpc402 1891 ucthi CoreTestB bob R 3:01 hpc402
Unfortunately this preemption mode is not compatible with thread suspension.