Reputation: 191
I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub
:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is --
in run.sh -- 1 32
, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n
, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/*
get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
Upvotes: 1
Views: 546
Reputation: 191
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes
to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub
Upvotes: 0
Reputation: 7203
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name>
and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL
to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>
. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
Upvotes: 1