Worker job submission
Most clusters have the ability to submit jobs from inside other jobs. Until recently this was disabled on hex. We have now enabled this facility on the series600 nodes. You can now call qsub from inside your job script although…
Say hello to HAL
Over the past few months we’ve been referring to a new type of scheduler that we’ve been testing. We decided to move away from PBS as MAUI is no longer being maintained. The scheduler we have selected, SLURM, is …
Berkeley Lab Checkpoint/Restart
Berkeley Lab Checkpoint/Restart has been installed on the SLURM cluster. This allows users to checkpoint a job, cancel it and then resume the job at a later date. The executable is started with the cr_run wrapper: cr_run /home/andy/ram.pl >> /home/andy/ramtest.out…
SLURM and memory management
SLURM allows a DefMemPerCPU and a MaxMemPerCPU to be set. If a user does not set a memory limit the default will be used. This is normally set to MaxMem/NumCores. As memory is a consumable resource (SelectTypeParameters=CR_Core_Memory) MaxMemPerCPU serves not
…SLURM job preemption
SLURM provides a preemption mechanism to deal with situations where cluster become overloaded. This can be configured in several ways:
FIFO:
This is the most simplistic method of queueing in which there is no preemption, jobs come in, queue and …
Where are all the HPC servers disappearing to?
Those users still making use of hpc.uct.ac.za will have noticed that a few worker nodes have vanished. As mentioned previously we’re investigating a new scheduler, SLURM, Simple Linux Utility for Resource Management. SLURM is a very different animal to
…