{"version":"1.0","provider_name":"UCT HPC","provider_url":"https:\/\/ucthpc.uct.ac.za","author_name":"Andrew Lewis","author_url":"https:\/\/ucthpc.uct.ac.za\/index.php\/author\/andrew-lewis\/","title":"Rough week - UCT HPC","type":"rich","width":600,"height":338,"html":"<blockquote class=\"wp-embedded-content\" data-secret=\"rlcDz6Tj7y\"><a href=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\">Rough week<\/a><\/blockquote><iframe sandbox=\"allow-scripts\" security=\"restricted\" src=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/embed\/#?secret=rlcDz6Tj7y\" width=\"600\" height=\"338\" title=\"&#8220;Rough week&#8221; &#8212; UCT HPC\" data-secret=\"rlcDz6Tj7y\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" class=\"wp-embedded-content\"><\/iframe><script type=\"text\/javascript\">\n\/*! This file is auto-generated *\/\n!function(c,d){\"use strict\";var e=!1,o=!1;if(d.querySelector)if(c.addEventListener)e=!0;if(c.wp=c.wp||{},c.wp.receiveEmbedMessage);else if(c.wp.receiveEmbedMessage=function(e){var t=e.data;if(!t);else if(!(t.secret||t.message||t.value));else if(\/[^a-zA-Z0-9]\/.test(t.secret));else{for(var r,s,a,i=d.querySelectorAll('iframe[data-secret=\"'+t.secret+'\"]'),n=d.querySelectorAll('blockquote[data-secret=\"'+t.secret+'\"]'),o=new RegExp(\"^https?:$\",\"i\"),l=0;l<n.length;l++)n[l].style.display=\"none\";for(l=0;l<i.length;l++)if(r=i[l],e.source!==r.contentWindow);else{if(r.removeAttribute(\"style\"),\"height\"===t.message){if(1e3<(s=parseInt(t.value,10)))s=1e3;else if(~~s<200)s=200;r.height=s}if(\"link\"===t.message)if(s=d.createElement(\"a\"),a=d.createElement(\"a\"),s.href=r.getAttribute(\"src\"),a.href=t.value,!o.test(a.protocol));else if(a.host===s.host)if(d.activeElement===r)c.top.location.href=t.value}}},e)c.addEventListener(\"message\",c.wp.receiveEmbedMessage,!1),d.addEventListener(\"DOMContentLoaded\",t,!1),c.addEventListener(\"load\",t,!1);function t(){if(o);else{o=!0;for(var e,t,r,s=-1!==navigator.appVersion.indexOf(\"MSIE 10\"),a=!!navigator.userAgent.match(\/Trident.*rv:11\\.\/),i=d.querySelectorAll(\"iframe.wp-embedded-content\"),n=0;n<i.length;n++){if(!(r=(t=i[n]).getAttribute(\"data-secret\")))r=Math.random().toString(36).substr(2,10),t.src+=\"#?secret=\"+r,t.setAttribute(\"data-secret\",r);if(s||a)(e=t.cloneNode(!0)).removeAttribute(\"security\"),t.parentNode.replaceChild(e,t);t.contentWindow.postMessage({message:\"ready\",secret:r},\"*\")}}}}(window,document);\n<\/script>\n","description":"On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. &nbsp;All the servers in the 4 right most columns in all enclosures lost power. &nbsp;We're still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.We lost 9 jobs and about 500 computational hours worth of work. &nbsp;Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. &nbsp;By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. &nbsp;Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. &nbsp;This had nothing to do with Monday's failure as it happened in a different data centre, however it was just&nbsp;as frustrating as we lost 3 jobs and about 180 hours of computing work. &nbsp;At 13:00 today we replaced the power regulator and the node is now back in the cluster."}