{"id":1182,"date":"2010-12-15T11:34:46","date_gmt":"2010-12-15T09:34:46","guid":{"rendered":"http:\/\/oldblogs.uct.ac.za\/blog\/big-bytes\/2010\/12\/15\/parallel-code-benefits-and-pit-falls"},"modified":"2022-09-26T19:53:37","modified_gmt":"2022-09-26T17:53:37","slug":"parallel-code-benefits-and-pit-falls","status":"publish","type":"post","link":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/","title":{"rendered":"Parallel code, benefits and pit-falls"},"content":{"rendered":"Most high end platforms for high performance computing are equipped with multi-core CPUs.\u00a0 In order to fully utilize the CPUs multiple jobs must be run on each platform or the code must be changed to utilize multiple CPUs.\u00a0 There are several methods used to take advantage of multiple CPUs; OpenMP, MPI, MPICH etc.\u00a0 The simpler approaches utilize one server and all its CPUs in a shared memory model, the more complex approach is to split the code accross several servers with a master process handling communication between the shared memories and aggragating the results.\u00a0 Either way, well written code split accross multiple CPUs can generally increase job efficiency.\r\n\r\nThere are obviously several caveats; some code cannot be 'parallelized' due to the nature of the algorithm, the code should be correctly optimized, disk IO should be reduced and in the shared memory model network latency can become a significant delaying factor.\r\n\r\nBelow is a graph of job completion times, where a lower (faster) result is better.\u00a0 The first bar is the time for the job to complete using only one processor.\u00a0 This is a simple array calculation compiled in C++ running on a BL460 blade with dual quad cores.\u00a0 The single CPU iterative job completes in 20 seconds.\u00a0 Next the code is compiled with the omp.h library allowing it to parallelize the array calculation loops.\u00a0 Unexpectedly the time to complete is longer than the iterative job.\u00a0 This is because the job was only allowed to run on one core.\u00a0 The overhead of the omp library managing multi-threading in the core is what caused the increase in run-time.\r\n<img src=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/07\/Times.png\" alt=\"MPI\" border=\"0\" \/>\r\n\r\nBy increasing the number cores on which the job is allowed to run we see\r\nan immediate increase in speed and reduction of job time.\u00a0 This is\r\nunfortunately not a linear improvement due to communication latency, in this case in the processor cache.\r\nOMP allows more threads to run than there are physical cores which is\r\nfine for the purpose of testing.\u00a0 Additionally one can run more than one\r\nmulti-threaded job per server.\u00a0 These practices however should be\r\navoided as they cause processor contention as the tasks are switched in\r\nand out of CPU context.\u00a0 This behaviour is clearly seen in the last two\r\njob runs.","protected":false},"excerpt":{"rendered":"<p>Most high end platforms for high performance computing are equipped with multi-core CPUs.&nbsp; In order to fully utilize the CPUs multiple jobs must be run on each platform or the code must be changed to utilize multiple CPUs.&nbsp; There are several methods used to take advantage of multiple CPUs; OpenMP, MPI, MPICH etc.&nbsp; The simpler approaches utilize one server and all its CPUs in a shared memory model, the more complex approach is to split the code accross several servers with a master process handling communication between the shared memories and aggragating the results.&nbsp; Either way, well written code split accross multiple CPUs can generally increase job efficiency.<\/p>\n<p>There are obviously several caveats; some code cannot be &#8216;parallelized&#8217; due to the nature of the algorithm, the code should be correctly optimized, disk IO should be reduced and in the shared memory model network latency can become a significant delaying factor.<\/p>\n<p>Below is a graph of job completion times, where a lower (faster) result is better.&nbsp; The first bar is the time for the job to complete using only one processor.&nbsp; This is a simple array calculation compiled in C++ running on a BL460 blade with dual quad cores.&nbsp; The single CPU iterative job completes in 20 seconds.&nbsp; Next the code is compiled with the omp.h library allowing it to parallelize the array calculation loops.&nbsp; Unexpectedly the time to complete is longer than the iterative job.&nbsp; This is because the job was only allowed to run on one core.&nbsp; The overhead of the omp library managing multi-threading in the core is what caused the increase in run-time.<br \/><img decoding=\"async\" src=\"http:\/\/blogs.uct.ac.za\/gallery\/1253\/Times.png\" border=\"0\" alt=\"MPI\"><\/p>\n<p>By increasing the number cores on which the job is allowed to run we see<br \/>\n an immediate increase in speed and reduction of job time.&nbsp; This is<br \/>\nunfortunately not a linear improvement due to communication latency, in this case in the processor cache.&nbsp;<br \/>\nOMP allows more threads to run than there are physical cores which is<br \/>\nfine for the purpose of testing.&nbsp; Additionally one can run more than one<br \/>\n multi-threaded job per server.&nbsp; These practices however should be<br \/>\navoided as they cause processor contention as the tasks are switched in<br \/>\nand out of CPU context.&nbsp; This behaviour is clearly seen in the last two<br \/>\njob runs. <\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[4,10,8],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Parallel code, benefits and pit-falls - UCT HPC<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Parallel code, benefits and pit-falls - UCT HPC\" \/>\n<meta property=\"og:description\" content=\"Most high end platforms for high performance computing are equipped with multi-core CPUs.&nbsp; In order to fully utilize the CPUs multiple jobs must be run on each platform or the code must be changed to utilize multiple CPUs.&nbsp; There are several methods used to take advantage of multiple CPUs; OpenMP, MPI, MPICH etc.&nbsp; The simpler approaches utilize one server and all its CPUs in a shared memory model, the more complex approach is to split the code accross several servers with a master process handling communication between the shared memories and aggragating the results.&nbsp; Either way, well written code split accross multiple CPUs can generally increase job efficiency.There are obviously several caveats; some code cannot be &#039;parallelized&#039; due to the nature of the algorithm, the code should be correctly optimized, disk IO should be reduced and in the shared memory model network latency can become a significant delaying factor.Below is a graph of job completion times, where a lower (faster) result is better.&nbsp; The first bar is the time for the job to complete using only one processor.&nbsp; This is a simple array calculation compiled in C++ running on a BL460 blade with dual quad cores.&nbsp; The single CPU iterative job completes in 20 seconds.&nbsp; Next the code is compiled with the omp.h library allowing it to parallelize the array calculation loops.&nbsp; Unexpectedly the time to complete is longer than the iterative job.&nbsp; This is because the job was only allowed to run on one core.&nbsp; The overhead of the omp library managing multi-threading in the core is what caused the increase in run-time.By increasing the number cores on which the job is allowed to run we see  an immediate increase in speed and reduction of job time.&nbsp; This is  unfortunately not a linear improvement due to communication latency, in this case in the processor cache.&nbsp;  OMP allows more threads to run than there are physical cores which is  fine for the purpose of testing.&nbsp; Additionally one can run more than one  multi-threaded job per server.&nbsp; These practices however should be  avoided as they cause processor contention as the tasks are switched in  and out of CPU context.&nbsp; This behaviour is clearly seen in the last two  job runs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\" \/>\n<meta property=\"og:site_name\" content=\"UCT HPC\" \/>\n<meta property=\"article:published_time\" content=\"2010-12-15T09:34:46+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-09-26T17:53:37+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/07\/Times.png\" \/>\n<meta name=\"author\" content=\"Andrew Lewis\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Andrew Lewis\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\"},\"author\":{\"name\":\"Andrew Lewis\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e\"},\"headline\":\"Parallel code, benefits and pit-falls\",\"datePublished\":\"2010-12-15T09:34:46+00:00\",\"dateModified\":\"2022-09-26T17:53:37+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\"},\"wordCount\":377,\"publisher\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\"},\"articleSection\":[\"hpc\",\"MPI\",\"programming\"],\"inLanguage\":\"en-ZA\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\",\"name\":\"Parallel code, benefits and pit-falls - UCT HPC\",\"isPartOf\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#website\"},\"datePublished\":\"2010-12-15T09:34:46+00:00\",\"dateModified\":\"2022-09-26T17:53:37+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/#breadcrumb\"},\"inLanguage\":\"en-ZA\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ucthpc.uct.ac.za\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Parallel code, benefits and pit-falls\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#website\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/\",\"name\":\"UCT HPC\",\"description\":\"University of Cape Town High Performance Computing\",\"publisher\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ucthpc.uct.ac.za\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-ZA\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#organization\",\"name\":\"University of Cape Town High Performance Computing\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-ZA\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png\",\"contentUrl\":\"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png\",\"width\":450,\"height\":423,\"caption\":\"University of Cape Town High Performance Computing\"},\"image\":{\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e\",\"name\":\"Andrew Lewis\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-ZA\",\"@id\":\"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g\",\"caption\":\"Andrew Lewis\"},\"sameAs\":[\"http:\/\/blogs.uct.ac.za\/blog\/big-bytes\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Parallel code, benefits and pit-falls - UCT HPC","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/","og_locale":"en_US","og_type":"article","og_title":"Parallel code, benefits and pit-falls - UCT HPC","og_description":"Most high end platforms for high performance computing are equipped with multi-core CPUs.&nbsp; In order to fully utilize the CPUs multiple jobs must be run on each platform or the code must be changed to utilize multiple CPUs.&nbsp; There are several methods used to take advantage of multiple CPUs; OpenMP, MPI, MPICH etc.&nbsp; The simpler approaches utilize one server and all its CPUs in a shared memory model, the more complex approach is to split the code accross several servers with a master process handling communication between the shared memories and aggragating the results.&nbsp; Either way, well written code split accross multiple CPUs can generally increase job efficiency.There are obviously several caveats; some code cannot be 'parallelized' due to the nature of the algorithm, the code should be correctly optimized, disk IO should be reduced and in the shared memory model network latency can become a significant delaying factor.Below is a graph of job completion times, where a lower (faster) result is better.&nbsp; The first bar is the time for the job to complete using only one processor.&nbsp; This is a simple array calculation compiled in C++ running on a BL460 blade with dual quad cores.&nbsp; The single CPU iterative job completes in 20 seconds.&nbsp; Next the code is compiled with the omp.h library allowing it to parallelize the array calculation loops.&nbsp; Unexpectedly the time to complete is longer than the iterative job.&nbsp; This is because the job was only allowed to run on one core.&nbsp; The overhead of the omp library managing multi-threading in the core is what caused the increase in run-time.By increasing the number cores on which the job is allowed to run we see  an immediate increase in speed and reduction of job time.&nbsp; This is  unfortunately not a linear improvement due to communication latency, in this case in the processor cache.&nbsp;  OMP allows more threads to run than there are physical cores which is  fine for the purpose of testing.&nbsp; Additionally one can run more than one  multi-threaded job per server.&nbsp; These practices however should be  avoided as they cause processor contention as the tasks are switched in  and out of CPU context.&nbsp; This behaviour is clearly seen in the last two  job runs.","og_url":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/","og_site_name":"UCT HPC","article_published_time":"2010-12-15T09:34:46+00:00","article_modified_time":"2022-09-26T17:53:37+00:00","og_image":[{"url":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/07\/Times.png"}],"author":"Andrew Lewis","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Andrew Lewis","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/#article","isPartOf":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/"},"author":{"name":"Andrew Lewis","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e"},"headline":"Parallel code, benefits and pit-falls","datePublished":"2010-12-15T09:34:46+00:00","dateModified":"2022-09-26T17:53:37+00:00","mainEntityOfPage":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/"},"wordCount":377,"publisher":{"@id":"https:\/\/ucthpc.uct.ac.za\/#organization"},"articleSection":["hpc","MPI","programming"],"inLanguage":"en-ZA"},{"@type":"WebPage","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/","url":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/","name":"Parallel code, benefits and pit-falls - UCT HPC","isPartOf":{"@id":"https:\/\/ucthpc.uct.ac.za\/#website"},"datePublished":"2010-12-15T09:34:46+00:00","dateModified":"2022-09-26T17:53:37+00:00","breadcrumb":{"@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/#breadcrumb"},"inLanguage":"en-ZA","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ucthpc.uct.ac.za\/"},{"@type":"ListItem","position":2,"name":"Parallel code, benefits and pit-falls"}]},{"@type":"WebSite","@id":"https:\/\/ucthpc.uct.ac.za\/#website","url":"https:\/\/ucthpc.uct.ac.za\/","name":"UCT HPC","description":"University of Cape Town High Performance Computing","publisher":{"@id":"https:\/\/ucthpc.uct.ac.za\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ucthpc.uct.ac.za\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-ZA"},{"@type":"Organization","@id":"https:\/\/ucthpc.uct.ac.za\/#organization","name":"University of Cape Town High Performance Computing","url":"https:\/\/ucthpc.uct.ac.za\/","logo":{"@type":"ImageObject","inLanguage":"en-ZA","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/","url":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png","contentUrl":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/09\/logocircless.png","width":450,"height":423,"caption":"University of Cape Town High Performance Computing"},"image":{"@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/c183ad1c0a1063124a72d63963ae9c7e","name":"Andrew Lewis","image":{"@type":"ImageObject","inLanguage":"en-ZA","@id":"https:\/\/ucthpc.uct.ac.za\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/9652c9c73beeab594b8dc2383a880048?s=96&d=mm&r=g","caption":"Andrew Lewis"},"sameAs":["http:\/\/blogs.uct.ac.za\/blog\/big-bytes"]}]}},"_links":{"self":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts\/1182"}],"collection":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/comments?post=1182"}],"version-history":[{"count":4,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts\/1182\/revisions"}],"predecessor-version":[{"id":4271,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/posts\/1182\/revisions\/4271"}],"wp:attachment":[{"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/media?parent=1182"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/categories?post=1182"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ucthpc.uct.ac.za\/index.php\/wp-json\/wp\/v2\/tags?post=1182"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}