<?xml version="1.0"?>
<oembed><version>1.0</version><provider_name>UCT HPC</provider_name><provider_url>https://ucthpc.uct.ac.za</provider_url><author_name>Andrew Lewis</author_name><author_url>https://ucthpc.uct.ac.za/index.php/author/andrew-lewis/</author_url><title>Rough week - UCT HPC</title><type>rich</type><width>600</width><height>338</height><html>&lt;blockquote class="wp-embedded-content" data-secret="DYRU3aDtBI"&gt;&lt;a href="https://ucthpc.uct.ac.za/index.php/2012/08/17/rough-week/"&gt;Rough week&lt;/a&gt;&lt;/blockquote&gt;&lt;iframe sandbox="allow-scripts" security="restricted" src="https://ucthpc.uct.ac.za/index.php/2012/08/17/rough-week/embed/#?secret=DYRU3aDtBI" width="600" height="338" title="&#x201C;Rough week&#x201D; &#x2014; UCT HPC" data-secret="DYRU3aDtBI" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" class="wp-embedded-content"&gt;&lt;/iframe&gt;&lt;script type="text/javascript"&gt;
/*! This file is auto-generated */
!function(c,d){"use strict";var e=!1,o=!1;if(d.querySelector)if(c.addEventListener)e=!0;if(c.wp=c.wp||{},c.wp.receiveEmbedMessage);else if(c.wp.receiveEmbedMessage=function(e){var t=e.data;if(!t);else if(!(t.secret||t.message||t.value));else if(/[^a-zA-Z0-9]/.test(t.secret));else{for(var r,s,a,i=d.querySelectorAll('iframe[data-secret="'+t.secret+'"]'),n=d.querySelectorAll('blockquote[data-secret="'+t.secret+'"]'),o=new RegExp("^https?:$","i"),l=0;l&lt;n.length;l++)n[l].style.display="none";for(l=0;l&lt;i.length;l++)if(r=i[l],e.source!==r.contentWindow);else{if(r.removeAttribute("style"),"height"===t.message){if(1e3&lt;(s=parseInt(t.value,10)))s=1e3;else if(~~s&lt;200)s=200;r.height=s}if("link"===t.message)if(s=d.createElement("a"),a=d.createElement("a"),s.href=r.getAttribute("src"),a.href=t.value,!o.test(a.protocol));else if(a.host===s.host)if(d.activeElement===r)c.top.location.href=t.value}}},e)c.addEventListener("message",c.wp.receiveEmbedMessage,!1),d.addEventListener("DOMContentLoaded",t,!1),c.addEventListener("load",t,!1);function t(){if(o);else{o=!0;for(var e,t,r,s=-1!==navigator.appVersion.indexOf("MSIE 10"),a=!!navigator.userAgent.match(/Trident.*rv:11\./),i=d.querySelectorAll("iframe.wp-embedded-content"),n=0;n&lt;i.length;n++){if(!(r=(t=i[n]).getAttribute("data-secret")))r=Math.random().toString(36).substr(2,10),t.src+="#?secret="+r,t.setAttribute("data-secret",r);if(s||a)(e=t.cloneNode(!0)).removeAttribute("security"),t.parentNode.replaceChild(e,t);t.contentWindow.postMessage({message:"ready",secret:r},"*")}}}}(window,document);
&lt;/script&gt;
</html><description>On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. &nbsp;All the servers in the 4 right most columns in all enclosures lost power. &nbsp;We're still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.We lost 9 jobs and about 500 computational hours worth of work. &nbsp;Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. &nbsp;By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. &nbsp;Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. &nbsp;This had nothing to do with Monday's failure as it happened in a different data centre, however it was just&nbsp;as frustrating as we lost 3 jobs and about 180 hours of computing work. &nbsp;At 13:00 today we replaced the power regulator and the node is now back in the cluster.</description></oembed>
