One of the frustrations users experience on a cluster made up of large numbers of homogeneous servers is that they often share this space with a diverse population of researchers and software. While it's possible to divide resources into groups to segregate usage patterns this will always lead to some wasting of resources. Conversely allowing users to submit jobs to any server can lead to the following situation with parallel jobs: a user needs 90 cores and hence needs to split their job over 3 servers. Logically this would mean 30 cores per server using the following directive:
-l nodes=3:ppn=30
This works fine as long as there are 30 cores free per server. However in a situation where servers 1 to 2 have 40 cores free each but server 3 has only 15 cores free, in this case the job will hang until server 3 has 30 cores free, but by then servers 1 to 2 may be running other jobs and hence the user may wait a long time for their requirements to be met and the job to run. One way around this would be to tailor a directive for this specific situation:
-l nodes=server1:ppn=40+server2:ppn=40+server3:ppn=10
This will work but is tedious to set up, inelegant and impractical if the user wants to run a large number of jobs. Torque supports a directive strategy where one can request the same number of cores as nodes:
-l nodes=90
which actually provides 90 cores not 90 nodes. This works fine as long as the number of nodes (cores) requested does not exceed the actual number of nodes (servers) in the cluster. In the above case as soon as the user requested -l nodes=5 the request is denied as Torque considers this to exceed the number of resources. There is a hack for this however, if one sets the following parameter in QMGR:
set server resources_available.nodect=N
set queue UCTlong resources_available.nodect=N
Where N is the total number of cores in the cluster. Now the user can request -l nodes=90 and Torque provides 90 cores according to the queuing strategy. One needs to be aware that distribution of the job over servers will be automatic hence total memory usage patterns and overall network latency will potentially be unpredictable.
Many thanks to Graham Inggs from UCT's Chemical Engineering department for this work around.