Reputation: 87
I don't have a lot of information, so please let me know what I can do to diagnose this.
My HPC has a few compute nodes and one of the jobs I had submitted last night paused after a few hours of runtime. I checked with qstat this morning and found that it had made no progress since I had last checked it, yesterday. The other nodes seem to be processing jobs fine.
I deleted the job and resubmitted it, but it appears as if it's in the queue, even though there are no other jobs scheduled ahead of it.
gstat shows that it has no processes lined up, but that the node is active.
qstat -s says "Not running: Draining system to allow starving job to run"
If it's helpful, this is set up in a CentOS 6.5 environment.
What else can I do to diagnose this issue?
Upvotes: 0
Views: 903
Reputation: 87
It turns out that torque scripts running for more than 24 hours cause a pause to be placed on all other jobs submitted too the scheduler. We needed to kill the responsible job and everything fell back into place.
Upvotes: 1