We encountered an issue where a user's jobs used far more cpus than allocated. This occured even though ppn was set to 1, the server load went up to 400+ and the individual processes were using up to 4000% CPU time.
Clearly a sub-optimal state.
We tracked this down to a java script, which was odd as it should run serial by default. Even stranger was the script, SortSam, a small, non parallel script for sorting bam files for GATK read-pair walkers. Googling against the script name provided the solution.
The problem was when Garbage Collection ran java detected it was running on a 64 bit OS and the JVM then ran in server mode, grabbing all the cores, i.e. all 50 jobs ran GC on all 64 cores, and that doesn't even include all the MatLab, Gromacs and NAMD jobs that other users were trying to run. We're pretty sure that we could have timed this condition by looking at the data centre temperature graphs.
Setting -XX:ParallelGCThreads=1 solved this issue.