What follows is a critical report on the cluster upgrade. Our plan was to upgrade the cluster operating systems on all servers and to bring the FHGFS file system up to the latest supported release, BGFS. Additionally we planned to upgrade to the latest Mellanox and Cuda drivers for our Infiniband and GPU cards respectively. The cluster was taken off-line on Monday morning at 09:00 and was back up again by Tuesday afternoon.
What went right
– All servers were upgraded from SLES11 sp3 to sp4 with the exception of node 600.
– All Infiniband and GPU drivers were upgraded.
What went wrong
– During the upgrade worker node 600 experienced an error and the operating system was damaged. This is being re-installed.
– Several minor issues were experienced in upgrading the Infiniband drivers but these were resolved.
– The FHGFS upgrade did not go as planned. The new release (BGFS) no longer includes several critical features which we require and the compile of the SLES client daemon failed. We have reverted to the older version (FHGFS) but at a slightly higher patch level.
We have added the latest version of OpenMPI to the cluster. The natively installed version of openmpi is no longer available in sp4. This is actually a good thing as 1.6.5 is ancient, however it means that without modification to scripts or profiles users’ mpi jobs will fail. To get around this issue users need to specify the version of mpi they need by placing one of the following line in their .bashrc file:
module add mpi/openmpi-1.8.8
module add mpi/openmpi-1.10.1