On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. All the servers in the 4 right most columns in all enclosures lost power. We're still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.
We lost 9 jobs and about 500 computational hours worth of work. Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.
Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. This had nothing to do with Monday's failure as it happened in a different data centre, however it was just as frustrating as we lost 3 jobs and about 180 hours of computing work. At 13:00 today we replaced the power regulator and the node is now back in the cluster.