{"id":927,"date":"2012-08-17T14:22:29","date_gmt":"2012-08-17T12:22:29","guid":{"rendered":"http:\/\/oldblogs.uct.ac.za\/blog\/big-bytes\/2012\/08\/17\/rough-week"},"modified":"2015-08-14T13:07:55","modified_gmt":"2015-08-14T11:07:55","slug":"rough-week","status":"publish","type":"post","link":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/","title":{"rendered":"Rough week"},"content":{"rendered":"<div>On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. \u00a0All the servers in the 4 right most columns in all enclosures lost power. \u00a0We're still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.<\/div>\r\n<div>We lost 9 jobs and about 500 computational hours worth of work. \u00a0Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. \u00a0By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. \u00a0Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.<\/div>\r\n<div>Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. \u00a0This had nothing to do with Monday's failure as it happened in a different data centre, however it was just\u00a0as frustrating as we lost 3 jobs and about 180 hours of computing work. \u00a0At 13:00 today we replaced the power regulator and the node is now back in the cluster.<\/div>","protected":false},"excerpt":{"rendered":"<div>On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. &nbsp;All the servers in the 4 right most columns in all enclosures lost power. &nbsp;We&#8217;re still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.<\/div>\n<div>We lost 9 jobs and about 500 computational hours worth of work. &nbsp;Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. &nbsp;By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. &nbsp;Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.<\/div>\n<div>Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. &nbsp;This had nothing to do with Monday&#8217;s failure as it happened in a different data centre, however it was just&nbsp;as frustrating as we lost 3 jobs and about 180 hours of computing work. &nbsp;At 13:00 today we replaced the power regulator and the node is now back in the cluster.<\/div>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Rough week - UCT HPC<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Rough week - UCT HPC\" \/>\n<meta property=\"og:description\" content=\"On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. &nbsp;All the servers in the 4 right most columns in all enclosures lost power. &nbsp;We&#039;re still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.We lost 9 jobs and about 500 computational hours worth of work. &nbsp;Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. &nbsp;By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. &nbsp;Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. &nbsp;This had nothing to do with Monday&#039;s failure as it happened in a different data centre, however it was just&nbsp;as frustrating as we lost 3 jobs and about 180 hours of computing work. &nbsp;At 13:00 today we replaced the power regulator and the node is now back in the cluster.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\" \/>\n<meta property=\"og:site_name\" content=\"UCT HPC\" \/>\n<meta property=\"article:published_time\" content=\"2012-08-17T12:22:29+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2015-08-14T11:07:55+00:00\" \/>\n<meta name=\"author\" content=\"Andrew Lewis\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Andrew Lewis\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\"},\"author\":{\"name\":\"Andrew Lewis\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e\"},\"headline\":\"Rough week\",\"datePublished\":\"2012-08-17T12:22:29+00:00\",\"dateModified\":\"2015-08-14T11:07:55+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\"},\"wordCount\":182,\"publisher\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\"},\"articleSection\":[\"hpc\"],\"inLanguage\":\"en-ZA\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\",\"name\":\"Rough week - UCT HPC\",\"isPartOf\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#website\"},\"datePublished\":\"2012-08-17T12:22:29+00:00\",\"dateModified\":\"2015-08-14T11:07:55+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/#breadcrumb\"},\"inLanguage\":\"en-ZA\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ucthpc.uct.ac.za\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Rough week\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#website\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/\",\"name\":\"UCT HPC\",\"description\":\"University of Cape Town High Performance Computing\",\"publisher\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ucthpc.uct.ac.za\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-ZA\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\",\"name\":\"University of Cape Town High Performance Computing\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-ZA\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png\",\"contentUrl\":\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png\",\"width\":450,\"height\":423,\"caption\":\"University of Cape Town High Performance Computing\"},\"image\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e\",\"name\":\"Andrew Lewis\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-ZA\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g\",\"caption\":\"Andrew Lewis\"},\"sameAs\":[\"http:\/\/blogs.uct.ac.za\/blog\/big-bytes\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Rough week - UCT HPC","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/","og_locale":"en_US","og_type":"article","og_title":"Rough week - UCT HPC","og_description":"On Monday the 13th of August at about 19:10 we had a power failure in one of our racks. &nbsp;All the servers in the 4 right most columns in all enclosures lost power. &nbsp;We're still not sure exactly how this happened. Fortunately all the non-HPC production services failed over to other active servers.We lost 9 jobs and about 500 computational hours worth of work. &nbsp;Our monitoring system reported the fault and we were working on reconfiguring the cluster to absorb the damage within about 15 minutes. &nbsp;By 9am the next morning the servers were restarted and by lunch time we felt confident enough to bring them back into the cluster. &nbsp;Strangely PBS automatically restarted 3 of the failed jobs which then ran successfully.Then at 5:05 on Thursday morning an individual node suffered an on-board power regulator failure. &nbsp;This had nothing to do with Monday's failure as it happened in a different data centre, however it was just&nbsp;as frustrating as we lost 3 jobs and about 180 hours of computing work. &nbsp;At 13:00 today we replaced the power regulator and the node is now back in the cluster.","og_url":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/","og_site_name":"UCT HPC","article_published_time":"2012-08-17T12:22:29+00:00","article_modified_time":"2015-08-14T11:07:55+00:00","author":"Andrew Lewis","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Andrew Lewis","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/#article","isPartOf":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/"},"author":{"name":"Andrew Lewis","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e"},"headline":"Rough week","datePublished":"2012-08-17T12:22:29+00:00","dateModified":"2015-08-14T11:07:55+00:00","mainEntityOfPage":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/"},"wordCount":182,"publisher":{"@id":"https:\/\/ucthpc.uct.ac.za\/#organization"},"articleSection":["hpc"],"inLanguage":"en-ZA"},{"@type":"WebPage","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/","url":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/","name":"Rough week - UCT HPC","isPartOf":{"@id":"https:\/\/ucthpc.uct.ac.za\/#website"},"datePublished":"2012-08-17T12:22:29+00:00","dateModified":"2015-08-14T11:07:55+00:00","breadcrumb":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/#breadcrumb"},"inLanguage":"en-ZA","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2012\/08\/17\/rough-week\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ucthpc.uct.ac.za\/"},{"@type":"ListItem","position":2,"name":"Rough week"}]},{"@type":"WebSite","@id":"https:\/\/ucthpc.uct.ac.za\/#website","url":"https:\/\/ucthpc.uct.ac.za\/","name":"UCT HPC","description":"University of Cape Town High Performance Computing","publisher":{"@id":"https:\/\/ucthpc.uct.ac.za\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ucthpc.uct.ac.za\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-ZA"},{"@type":"Organization","@id":"https:\/\/ucthpc.uct.ac.za\/#organization","name":"University of Cape Town High Performance Computing","url":"https:\/\/ucthpc.uct.ac.za\/","logo":{"@type":"ImageObject","inLanguage":"en-ZA","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/","url":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png","contentUrl":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png","width":450,"height":423,"caption":"University of Cape Town High Performance Computing"},"image":{"@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e","name":"Andrew Lewis","image":{"@type":"ImageObject","inLanguage":"en-ZA","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g","caption":"Andrew Lewis"},"sameAs":["http:\/\/blogs.uct.ac.za\/blog\/big-bytes"]}]}},"_links":{"self":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts\/927"}],"collection":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/comments?post=927"}],"version-history":[{"count":2,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts\/927\/revisions"}],"predecessor-version":[{"id":2151,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts\/927\/revisions\/2151"}],"wp:attachment":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/media?parent=927"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/categories?post=927"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/tags?post=927"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}