It’s been a while since the HPC cluster has had a major update. Over the last few weeks we’ve been planning and constructing a new test environment. One of the major issues on the horizon is the impending demise of Centos. Free, stable, robust and well supported, Centos was the OS of choice for many institutes and companies. We’ve spent the last year watching the development of its successor, Rocky Linux, and given the community support and development we have decided to base our new test cluster on this OS.
We’ll be keeping SLURM as our scheduler as it is popular, well supported and reliable. Our version has lagged considerably and there are several bugs which the latest release addresses.
In a recent test we submitted just over 60,000 jobs in 10 minutes. The jobs were real-world MPI tasks submitted by two test users to a variety of partitions, each job having wildly differing wall times in order to stress test the backfill algorithm. The scheduler had no difficulties in handling this load, nor did the backend database.
Submitted jobs
Backfill scheduler workload
In addition to the OS and scheduler the monitoring and reporting tools will be streamlined and updated.
We do not yet have a timeline for the upgrade, however we’re looking at implementing the new cluster some time in 2023.