This evening we observed some weirdness on our Grid cluster. Users could submit jobs but they were immediately queued. Initially it was suspected that only one worker node was affected, however we soon realized that all three worker nodes were exhibiting the same issue. Oddly some jobs (short term test jobs submitted via EUMed) were running.
Restarting the pbs_server daemon on the head node had no effect, other than to cause all worker nodes to register a down status. Checking the worker nodes revealed that all pbs_mom daemons were in a running but dead state. Restarting all pbs_mom daemons allowed some jobs to be submitted, however this was only on 2 of the worker nodes. It was then noted that there was an old SAGrid job that was still in a queued state from several days ago. Killing this job put the queues back into a happy state.
Not sure exactly what the issue was, possibly a malformed queue submission or JDL causing a hang up in the scheduler. Currently we are considering increasing the level of monitoring to test the state of the pbs daemons.