We've spent most of this week working on mpiBlast, and have bumped our heads against a few problems. Fortunately we have a very patient user in computational biology who's assisted in running test jobs. It's been a learning experience so we thought we'd jot down a few notes...
Running OpenMPI jobs in Torque/PBS is not quite the same as running them directly from a head node. Firstly the initial worker node that the job is launched from is considered a 'head node' from OpenMPI's perspective. This means that when setting up key sharing in the cluster a many-to-many relationship is required between worker nodes.
Additionally the way that PBS and mpirun are invoked are slightly different. When dealing with OpenMPI jobs it's best to specify only the number of cores the job needs. However in order to do this the PBS nodes argument to the -l parameter is considered to be CPUs, not servers.
There are two other crucial elements to bare in mind. Firstly the machine or host file should be referenced from PBS, rather than user-created. This is done by using the $PBS_NODEFILE variable. Secondly PBS should be allowed to supply the cores, rather then request them via mpirun's -np argument. The number of nodes versus threads that users can consume can be controled via the maui.cfg file
Below is a screenshot of multiple MPI jobs seeking 5 CPUs each on any series worker node. The starting node for the initial job was unspecified and turned out to be 300. Nodes 300, 206, 205 and 204 have high CPU but no threads advertised as they're just winding down from 3 completed jobs, totaling 15 cores. The 1 thread on 204 is the first of 5 spread "left" into 203.
Another item to consider is heterogenous environments. Not all clusters are composed of identical equipment, hence allowing auto-assignment of resources in MPI jobs can produce unpredictable results. In the image above the 300 series CPUs are taking longer to spool down than the 200 series. In order to constrain job runs use can be made of the free form node_spec tag in the nodes file. However here you should remember that once again nodes = servers so you'll need the ppn directive.
So to reserve 20 cores on the BL460 servers use the directive: #PBS -l nodes=4:series400:ppn=5
If there are any inaccuracies in the above please feel free to point them out. Here are two article we found useful running OpenMPI under Torque.