{"id":4453,"date":"2022-11-18T17:29:49","date_gmt":"2022-11-18T15:29:49","guid":{"rendered":"https:\/\/ucthpc.uct.ac.za\/?page_id=4453"},"modified":"2023-06-08T14:37:27","modified_gmt":"2023-06-08T12:37:27","slug":"the-hpc-dashboard","status":"publish","type":"page","link":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/","title":{"rendered":"The HPC Dashboard"},"content":{"rendered":"<p>There are several ways to interrogate the cluster to view its general status. There are command line tools such as qstat, cores and sinfo, however those are all text based. The <a href=\"http:\/\/hpc.uct.ac.za\/db\/\" target=\"_blank\" rel=\"noopener\">dashboard<\/a> provides a single visual interface that allows one to scan every single core on the cluster at a glance to determine how busy it is. The only drawback is that this view looks a bit like an explosion in a pixel factory, so some interpretation is needed.<\/p>\n<p>Below is a high-level view of the dashboard page. It is split into 5 sections:<\/p>\n<p>1 &#8211; <a href=\"#nodes\">Nodes<\/a><br \/>\n2 &#8211; <a href=\"#general\">General<\/a><br \/>\n3 &#8211; <a href=\"#jobs\">Jobs<\/a><br \/>\n4 &#8211; <a href=\"#clusterstatus\">Cluster status<\/a><br \/>\n5 &#8211; <a href=\"#workernode\">Worker node status<\/a><\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-4454\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-175x300.jpg\" alt=\"\" width=\"175\" height=\"300\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-175x300.jpg 175w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-596x1024.jpg 596w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-600x1031.jpg 600w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-768x1319.jpg 768w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview.jpg 794w\" sizes=\"(max-width: 175px) 100vw, 175px\" \/><\/a><a name=\"nodes\"><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong>1 &#8211; Nodes<\/strong><\/p>\n<p>This very busy graphical view of the cluster shows all of the nodes and cores against their partitions. Each block is a worker node (server). Each vertical bar is a core with the last vertical bar being the RAM:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-4461 size-full\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/basic.jpg\" alt=\"\" width=\"382\" height=\"133\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/basic.jpg 382w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/basic-300x104.jpg 300w\" sizes=\"(max-width: 382px) 100vw, 382px\" \/><\/p>\n<p>Left is an example of a typical worker node in use.\u00a0 All 40 cores are reserved. The processes running on this are &#8220;bursty&#8221;, right now they are only engaging the cores at about 40% on average, however during the previous polling cycle the load average was about 90%.\u00a0 The RAM bar graph shows the amount of RAM remaining, 90%.<\/p>\n<p>&nbsp;<\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/problem.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignright wp-image-4462 size-full\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/problem.jpg\" alt=\"\" width=\"500\" height=\"201\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/problem.jpg 500w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/problem-300x121.jpg 300w\" sizes=\"(max-width: 500px) 100vw, 500px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Right is an example of a GPU node that is registering a problem.\u00a0 The node background is faded red to indicate an issue, in this case more cores and GPU cards are in use than have been reserved.\u00a0 The administrator has placed this node into draining mode while the issue is dealt with. This is typical of a job that has not terminated correctly.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/down.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-4463 size-full\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/down.jpg\" alt=\"\" width=\"138\" height=\"67\" \/><\/a>This node is marked as &#8216;down&#8217; as it is powered off, the SLURM daemon is not running or it is not reachable on the network.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2023\/06\/dormant.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-4529 alignleft\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2023\/06\/dormant.jpg\" alt=\"\" width=\"125\" height=\"66\" \/><\/a>This node&#8217;s cores and memory graphs are faded to indicate that it has probably crashed, is no longer reporting core and memory values, however when the head node last got data from it these were the values it reported.<\/p>\n<p>&nbsp;<\/p>\n<p><a name=\"general\"><\/a><br \/>\n<strong>2- General<\/strong><\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/general.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-4466 size-full alignleft\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/general.jpg\" alt=\"\" width=\"700\" height=\"53\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/general.jpg 700w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/general-300x23.jpg 300w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/general-600x45.jpg 600w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/a>This section shows the general health of the cluster and contains links to various pages with more detailed information.\u00a0 On the left is a bar graph indicating the remaining space on the shared disks.<\/p>\n<p>On the right are the logged in users, the CPU load on the head node as well as the available RAM. Below this is the number of CPU hours currently being computed as well as the number of jobs running and on the queue.<br \/>\n<a name=\"jobs\"><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong>3 &#8211; Jobs<\/strong><\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/jobs.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-4471 size-medium alignleft\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/jobs-300x143.jpg\" alt=\"\" width=\"300\" height=\"143\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/jobs-300x143.jpg 300w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/jobs-600x285.jpg 600w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/jobs-768x365.jpg 768w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/jobs.jpg 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a>This section shows the jobs that are running or queued.\u00a0 The Q\\R-TIME column indicates how long jobs have been running for or have been queued for. Queued jobs are shown in grey. The CPUTIME column is the product of the TIME column and the reserved CORES column. The list is sorted by priority.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>The longer a job remains queued the higher its priority and the greater the chance of it running. There are many reasons why a job might be queued, below are some of the more common ones:<\/p>\n<pre><span style=\"color: #666666;\">Resources - There are insufficient free cores\\nodes to run your job.\r\nAssocGrpCpuLimit - You have reached your core limit, your currently running jobs need to end first.\r\nAssocMaxJobsLimit - You have reached your job limit, your currently running jobs need to end first. This is normally applied to a group of users which means that other users jobs in the group count against your limit.\r\nMaxWallTimePerJob - You have specified a wall time in excess of the limit for your account or the partition.\r\nPriority - Your job could run but another job which has been waiting longer has priority over the resources.<\/span><\/pre>\n<p><a name=\"clusterstatus\"><\/a><\/p>\n<p><strong>4 &#8211; Cluster status<\/strong><br \/>\nThis section shows the default partition limits and the state of the nodes in those partitions. Of importance here is the default wall time as well as the maximum memory per CPU. The partitions are listed multiple times as there are nodes listed in a variety of states.<br \/>\n<a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/status.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-4476\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/status-300x241.jpg\" alt=\"\" width=\"300\" height=\"241\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/status-300x241.jpg 300w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/status-600x482.jpg 600w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/status.jpg 700w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><a name=\"workernode\"><\/a><\/p>\n<p><strong>5 &#8211; Worker node status<\/strong><\/p>\n<p>This section shows the current state of the worker nodes as well as the jobs running on them.<\/p>\n<p><a ref=\"magnificPopup\" href=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/wn.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-medium wp-image-4478\" src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/wn-300x80.jpg\" alt=\"\" width=\"300\" height=\"80\" srcset=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/wn-300x80.jpg 300w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/wn-600x161.jpg 600w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/wn-768x206.jpg 768w, https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/wn.jpg 870w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>In the case of\u00a0 a GPU node the resource section will include the available GPU instances.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are several ways to interrogate the cluster to view its general status. There are command line tools such as qstat, cores and sinfo, however those are all text based. The dashboard provides a single visual interface that allows one to scan every single core on the cluster at a glance to determine how busy&#8230;<\/p>\n","protected":false},"author":2,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The HPC Dashboard - UCT HPC<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The HPC Dashboard - UCT HPC\" \/>\n<meta property=\"og:description\" content=\"There are several ways to interrogate the cluster to view its general status. There are command line tools such as qstat, cores and sinfo, however those are all text based. The dashboard provides a single visual interface that allows one to scan every single core on the cluster at a glance to determine how busy...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/\" \/>\n<meta property=\"og:site_name\" content=\"UCT HPC\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-08T12:37:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-175x300.jpg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/\",\"name\":\"The HPC Dashboard - UCT HPC\",\"isPartOf\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#website\"},\"datePublished\":\"2022-11-18T15:29:49+00:00\",\"dateModified\":\"2023-06-08T12:37:27+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/#breadcrumb\"},\"inLanguage\":\"en-ZA\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ucthpc.uct.ac.za\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The HPC Dashboard\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#website\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/\",\"name\":\"UCT HPC\",\"description\":\"University of Cape Town High Performance Computing\",\"publisher\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ucthpc.uct.ac.za\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-ZA\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\",\"name\":\"University of Cape Town High Performance Computing\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-ZA\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png\",\"contentUrl\":\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png\",\"width\":450,\"height\":423,\"caption\":\"University of Cape Town High Performance Computing\"},\"image\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The HPC Dashboard - UCT HPC","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/","og_locale":"en_US","og_type":"article","og_title":"The HPC Dashboard - UCT HPC","og_description":"There are several ways to interrogate the cluster to view its general status. There are command line tools such as qstat, cores and sinfo, however those are all text based. The dashboard provides a single visual interface that allows one to scan every single core on the cluster at a glance to determine how busy...","og_url":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/","og_site_name":"UCT HPC","article_modified_time":"2023-06-08T12:37:27+00:00","og_image":[{"url":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2022\/11\/Overview-175x300.jpg"}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/","url":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/","name":"The HPC Dashboard - UCT HPC","isPartOf":{"@id":"https:\/\/ucthpc.uct.ac.za\/#website"},"datePublished":"2022-11-18T15:29:49+00:00","dateModified":"2023-06-08T12:37:27+00:00","breadcrumb":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/#breadcrumb"},"inLanguage":"en-ZA","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/the-hpc-dashboard\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ucthpc.uct.ac.za\/"},{"@type":"ListItem","position":2,"name":"The HPC Dashboard"}]},{"@type":"WebSite","@id":"https:\/\/ucthpc.uct.ac.za\/#website","url":"https:\/\/ucthpc.uct.ac.za\/","name":"UCT HPC","description":"University of Cape Town High Performance Computing","publisher":{"@id":"https:\/\/ucthpc.uct.ac.za\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ucthpc.uct.ac.za\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-ZA"},{"@type":"Organization","@id":"https:\/\/ucthpc.uct.ac.za\/#organization","name":"University of Cape Town High Performance Computing","url":"https:\/\/ucthpc.uct.ac.za\/","logo":{"@type":"ImageObject","inLanguage":"en-ZA","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/","url":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png","contentUrl":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png","width":450,"height":423,"caption":"University of Cape Town High Performance Computing"},"image":{"@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/pages\/4453"}],"collection":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/comments?post=4453"}],"version-history":[{"count":24,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/pages\/4453\/revisions"}],"predecessor-version":[{"id":4531,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/pages\/4453\/revisions\/4531"}],"wp:attachment":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/media?parent=4453"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}