{"version":"1.0","provider_name":"UCT HPC","provider_url":"https:\/\/ucthpc.uct.ac.za","author_name":"Andrew Lewis","author_url":"https:\/\/ucthpc.uct.ac.za\/index.php\/author\/andrew-lewis\/","title":"Parallel code, benefits and pit-falls - UCT HPC","type":"rich","width":600,"height":338,"html":"<blockquote class=\"wp-embedded-content\" data-secret=\"0mgwwdc9A3\"><a href=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/\">Parallel code, benefits and pit-falls<\/a><\/blockquote><iframe sandbox=\"allow-scripts\" security=\"restricted\" src=\"https:\/\/ucthpc.uct.ac.za\/index.php\/2010\/12\/15\/parallel-code-benefits-and-pit-falls\/embed\/#?secret=0mgwwdc9A3\" width=\"600\" height=\"338\" title=\"&#8220;Parallel code, benefits and pit-falls&#8221; &#8212; UCT HPC\" data-secret=\"0mgwwdc9A3\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" class=\"wp-embedded-content\"><\/iframe><script type=\"text\/javascript\">\n\/*! This file is auto-generated *\/\n!function(c,d){\"use strict\";var e=!1,o=!1;if(d.querySelector)if(c.addEventListener)e=!0;if(c.wp=c.wp||{},c.wp.receiveEmbedMessage);else if(c.wp.receiveEmbedMessage=function(e){var t=e.data;if(!t);else if(!(t.secret||t.message||t.value));else if(\/[^a-zA-Z0-9]\/.test(t.secret));else{for(var r,s,a,i=d.querySelectorAll('iframe[data-secret=\"'+t.secret+'\"]'),n=d.querySelectorAll('blockquote[data-secret=\"'+t.secret+'\"]'),o=new RegExp(\"^https?:$\",\"i\"),l=0;l<n.length;l++)n[l].style.display=\"none\";for(l=0;l<i.length;l++)if(r=i[l],e.source!==r.contentWindow);else{if(r.removeAttribute(\"style\"),\"height\"===t.message){if(1e3<(s=parseInt(t.value,10)))s=1e3;else if(~~s<200)s=200;r.height=s}if(\"link\"===t.message)if(s=d.createElement(\"a\"),a=d.createElement(\"a\"),s.href=r.getAttribute(\"src\"),a.href=t.value,!o.test(a.protocol));else if(a.host===s.host)if(d.activeElement===r)c.top.location.href=t.value}}},e)c.addEventListener(\"message\",c.wp.receiveEmbedMessage,!1),d.addEventListener(\"DOMContentLoaded\",t,!1),c.addEventListener(\"load\",t,!1);function t(){if(o);else{o=!0;for(var e,t,r,s=-1!==navigator.appVersion.indexOf(\"MSIE 10\"),a=!!navigator.userAgent.match(\/Trident.*rv:11\\.\/),i=d.querySelectorAll(\"iframe.wp-embedded-content\"),n=0;n<i.length;n++){if(!(r=(t=i[n]).getAttribute(\"data-secret\")))r=Math.random().toString(36).substr(2,10),t.src+=\"#?secret=\"+r,t.setAttribute(\"data-secret\",r);if(s||a)(e=t.cloneNode(!0)).removeAttribute(\"security\"),t.parentNode.replaceChild(e,t);t.contentWindow.postMessage({message:\"ready\",secret:r},\"*\")}}}}(window,document);\n<\/script>\n","description":"Most high end platforms for high performance computing are equipped with multi-core CPUs.&nbsp; In order to fully utilize the CPUs multiple jobs must be run on each platform or the code must be changed to utilize multiple CPUs.&nbsp; There are several methods used to take advantage of multiple CPUs; OpenMP, MPI, MPICH etc.&nbsp; The simpler approaches utilize one server and all its CPUs in a shared memory model, the more complex approach is to split the code accross several servers with a master process handling communication between the shared memories and aggragating the results.&nbsp; Either way, well written code split accross multiple CPUs can generally increase job efficiency.There are obviously several caveats; some code cannot be 'parallelized' due to the nature of the algorithm, the code should be correctly optimized, disk IO should be reduced and in the shared memory model network latency can become a significant delaying factor.Below is a graph of job completion times, where a lower (faster) result is better.&nbsp; The first bar is the time for the job to complete using only one processor.&nbsp; This is a simple array calculation compiled in C++ running on a BL460 blade with dual quad cores.&nbsp; The single CPU iterative job completes in 20 seconds.&nbsp; Next the code is compiled with the omp.h library allowing it to parallelize the array calculation loops.&nbsp; Unexpectedly the time to complete is longer than the iterative job.&nbsp; This is because the job was only allowed to run on one core.&nbsp; The overhead of the omp library managing multi-threading in the core is what caused the increase in run-time.By increasing the number cores on which the job is allowed to run we see  an immediate increase in speed and reduction of job time.&nbsp; This is  unfortunately not a linear improvement due to communication latency, in this case in the processor cache.&nbsp;  OMP allows more threads to run than there are physical cores which is  fine for the purpose of testing.&nbsp; Additionally one can run more than one  multi-threaded job per server.&nbsp; These practices however should be  avoided as they cause processor contention as the tasks are switched in  and out of CPU context.&nbsp; This behaviour is clearly seen in the last two  job runs.","thumbnail_url":"https:\/\/ucthpc.uct.ac.za\/wp-content\/uploads\/2015\/07\/Times.png"}