Reputation: 1527
How do I get a list of (recently) failed jobs (failed=100 or exit_status=137) from the SGE? From the qacct help:
[-j [job_id|job_name|pattern]] list all [matching] jobs
How do I use the pattern? I tried the following, does not work.
qacct -j failed=100
Upvotes: 2
Views: 4301
Reputation: 11
I wrote a python script to parse the accounting file for failed jobs. You should edit it to your own use.
#!/usr/local/bin/python2.7
import os
from sys import *
import sys
import getopt
import datetime
#Variables
program = "parse_acct.py"
ifile = "/local/cluster/sge/default/common/accounting"
failed = 0
failedswitch = 0
subtime = 0
subtimeswitch = 0
begtime = 0
begtimeswitch = 0
endtime = 0
endtimeswitch = 0
user = 0
userswitch = 0
node = ""
nodeswitch = 0
### Read command line args
try:
myopts, args = getopt.getopt(sys.argv[1:],"i:f:n:t:u:b:e:h")
except getopt.GetoptError:
print program + " -i <input> -u <username> -n <node_name> -f"
sys.exit(2)
###############################
# o == option
# a == argument passed to the o
###############################
for o, a in myopts:
if o == '-f':
failed = a
failedswitch = 1
elif o == '-i':
ifile = a
elif o == '-u':
user = a
userswitch = 1
elif o == '-t':
subtime = a
subtimeswitch = 1
elif o == '-b':
begtime = a
begtimeswitch = 1
elif o == '-e':
endtime = a
endtimeswitch = 1
elif o == '-n':
node = a
nodeswitch = 1
elif o == '-h':
print program + " -i <input> -u <username> -n <node_name> -f"
sys.exit(0)
else:
print("Usage: %s -i <input> -u <username> -n <node_name> -f" % sys.argv[0])
sys.exit(0)
### --- Read line by line and import in to a list of lists --- ###
loi = []
f = open(ifile, "r")
for var in f:
line = var.rstrip().split(":")
if len(line) >= 10:
loi.append(line)
#print line
f.close()
### --- Parse through the list of lists and put a 0 to the beginning if it fails a test --- ###
for i in range(len(loi)):
if failedswitch == 1 and loi[i][11] >= 1: #!= failed:
loi[i][0] = [0]
elif userswitch == 1 and loi[i][3] != user:
loi[i][0] = [0]
elif nodeswitch == 1 and node != loi[i][1]:
loi[i][0] = [0]
# elif nodeswitch == 1 and node not in loi[i][1]:
# loi[i][0] = [0]
# elif nodeswitch == 1 and node not in loi[i][1]:
# loi[i][0] = [0]
# elif nodeswitch == 1 and node not in loi[i][1]:
# loi[i][0] = [0]
# elif nodeswitch == 1 and node not in loi[i][1]:
# loi[i][0] = [0]
### --- Remove all entries that have the "0" at the beginning --- ###
loidedup = [x for x in loi if x[0] != [0]
### --- Print out the files that passed all tests --- ###
for i in range(len(loidedup)):
print "=============================================================="
print "qname " + loidedup[i][0]
print "hostname " + loidedup[i][1]
print "group " + loidedup[i][2]
print "owner " + loidedup[i][3]
print "job_name " + loidedup[i][4]
print "job_number " + loidedup[i][5]
print "account " + loidedup[i][6]
print "priority " + loidedup[i][7]
print "submission_time " + datetime.datetime.fromtimestamp(int(loidedup[i][8])).strftime('%Y-%m-%d %H:%M:%S')
print "start_time " + datetime.datetime.fromtimestamp(int(loidedup[i][9])).strftime('%Y-%m-%d %H:%M:%S')
print "end_time " + datetime.datetime.fromtimestamp(int(loidedup[i][10])).strftime('%Y-%m-%d %H:%M:%S')
print "failed " + loidedup[i][11]
print "exit_status " + loidedup[i][12]
print "ru_wallclock " + loidedup[i][13]
print " ru_utime " + loidedup[i][14]
print " ru_stime " + loidedup[i][15]
print " ru_maxrss " + loidedup[i][16]
print " ru_ixrss " + loidedup[i][17]
print " ru_ismrss " + loidedup[i][18]
print " ru_idrss " + loidedup[i][19]
print " ru_isrss " + loidedup[i][20]
print " ru_minflt " + loidedup[i][21]
print " ru_majflt " + loidedup[i][22]
print " ru_nswap " + loidedup[i][23]
print " ru_inblock " + loidedup[i][24]
print " ru_oublock " + loidedup[i][25]
print " ru_msgsnd " + loidedup[i][26]
print " ru_msgrcv " + loidedup[i][27]
print " ru_nsignals " + loidedup[i][28]
print " ru_nvcsw " + loidedup[i][29]
print " ru_nivcsw " + loidedup[i][30]
print "project " + loidedup[i][31]
print "department " + loidedup[i][32]
print "granted_pe " + loidedup[i][33]
print "slots " + loidedup[i][34]
print "task_number " + loidedup[i][35]
print "cpu " + loidedup[i][36]
print "mem " + loidedup[i][37]
print "io " + loidedup[i][38]
print "category " + loidedup[i][39]
print "iow " + loidedup[i][40]
print "pe_taskid " + loidedup[i][41]
print "maxvmem " + loidedup[i][42]
print "arid " + loidedup[i][43]
print "ar_submission_time " + loidedup[i][44]
# print loidedup[i]
Upvotes: 1
Reputation: 401
"pattern" in this case refers to a simple globbing expression to match against a job name, e.g. qacct -j 'myjob*'
qacct
unfortunately doesn't have the filtration capability you're looking for - it's possible to filter on complex job attributes, but not fundamental ones like exit_status
or failed
.
You CAN retrieve that information from the SGE accounting file(assuming you have access to it) with just a little work. When SGE finishes a job, it writes out a simple record to $SGE_ROOT/$SGE_CELL/common/accounting
- this is the file that qacct
reads. You'll want to check the accounting(5)
man page on your qmaster for details specific to your GridEngine version, but a job record in your accounting file should more or less look like this:
all.q:myexechost:group:user:myjobstep16:1126971:sge:0:1369755166:1369768897:1369769771:0:0:874:796.564903:30.676336:15788.000000:0:0:0:0:17009:2:0:47987400.000000:34033048:0:0:0:9468:27604:NONE:defaultdepartment:NONE:1:0:827.241239:96.445328:39.111400:-q all.q:0.000000:NONE:237133824.000000:0:0
In this particular record, failed and exit_status are the 12th and 13th fields, respectively. For a quick and dirty "recent failures" list, we can use these along with fields 6(job id) and 11(job end time) like so to reveal any failures in the most recent 100 jobs:
$ cut -d':' -f6,11,12,13 $SGE_ROOT/$SGE_CELL/common/accounting|sort -t':' -k2|tail -100|grep ':100:137'
Upvotes: 5