This weekend the head node suffered an unexpected reboot. We're still not sure what the cause of this was. However it looks as if the running jobs were not effected.…
We're currently investigating a memory issue with some of the worker nodes. Memory is not being freed up after jobs complete.
UPDATE - 4 July:
Turns out it's not a memory error. The problem is the way that net-snmp monitors…
Patched kernels on HPC servers to 2.6.18-238.1.1.el5; All went fine except for the head node which has an issue with latest kernel (dies at boot with a kernel panic) so booting it into older version 2.6.18-194.1.1.el5 until we can sort…