Why are my jobs queuing?!?

Apr 16, 2014hardware, hpc, maui, torque

Quite simply, our cluster is running at almost maximum capacity, and there's not much we can do about that. It's always nice to be wanted though :-) Our strategy to deal with this is three-fold: 1) Shuffle user priorities to allow low coreshort term users to jump the queue. This may seem unfair, but consider the example of a user requiring 32 cores holding everyone back for 2 days even though some users only need 1 core for an hour and there are currently 31 cores free. Unfortunately the built in fair-share policy can't deal with this situation fast enough so this is being done manually on a case by case basis just to keep as many jobs flowing as possible. We have set the Galaxy user to have a very high priority as these are pipe-line jobs that need to be dealt with sequentially. They are also fairly short term and they run in batches. 2) Move users to other queues. We may ask some of our users to change their scripts to run on CLOUDQ, or CLOUDHMQ. Most likely these will be either users who run short term jobs (sub 1 hour) or high memory jobs. 3) Buy more kit. This is being done, but it's expensive and needs to be well motivated. We hope to have another 512 cores put in over the next few months.

Why are my jobs queuing?!?

UCTHPC