Berkeley Lab Checkpoint/Restart has been installed on the SLURM cluster. This allows users to checkpoint a job, cancel it and then resume the job at a later date. The executable is started with the cr_run wrapper:
cr_run /home/andy/ram.pl >> /home/andy/ramtest.out
This trivial job adds 100KB of data to an array every second and outputs the index of the array to a file.
Start the job:
~$ sbatch ramtest.sh Submitted batch job 2180 ~$ squeue JOBID PARTITION NAME USER ST TIME NODELIST 2180 ucthimem MemTest andy R 0:05 hpc406
Create a checkpoint file at time=t1
~$ /opt/slurm/bin/scontrol checkpoint create 2180 ImageDir=/home/andy
Contents of output file at t1:
cat ramtest.out starting at Tue Jul 7 13:08:35 2015 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
A checkpoint folder has now been created:
ls -l /home/andy/2180/ -r-------- 1 andy andy 5775884 Jul 7 15:09 script.ckpt
Cancel job at time=t2
~$ scancel 2180
Contents of output file at t2:
starting at Tue Jul 7 13:08:35 2015 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
And the job has been stopped:
~$ cat slurm-2180.out slurmstepd: *** JOB 2180 CANCELLED AT 2015-07-07T15:12:05 *** on srvcnthpc406
Restart the job at time=t3
~$ /opt/slurm/bin/scontrol checkpoint restart 2180 ImageDir=/home/alewis scontrol_checkpoint error: Duplicate job id
This command failed as the scheduler keeps a short list of the last few JobIDs. You have to wait at least 15 minutes for the scheduler’s short term memory to ‘forget’ about this job.
Restart the job at time=t4
~$ /opt/slurm/bin/scontrol checkpoint restart 2180 ImageDir=/home/alewis
The job is restarted in the pending state:
alewis@srvcnthpc501:~$ squeue JOBID PARTITION NAME USER ST TIME NODELIST 2180 ucthimem MemTest andy PD 0:00 (None)
After several seconds the job runs
alewis@srvcnthpc501:~$ squeue JOBID PARTITION NAME USER ST TIME NODELIST 2180 ucthimem MemTest andy R 0:04 hpc406
The job files are reset as they were at t1:
~$ cat ramtest.out starting at Tue Jul 7 13:08:35 2015 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18