Reputation: 11
I have the following text file job.txt
. I want to extract
few fields like 48638
(without the words cluster),the time field ,Q
in every line into a list.
Please do guide me, I have tried these :
content = [x.strip('\n') for x in content]
stlist=content[2:]
to delete the first two lines but not able to get the output as below.
Output of the list must be like :
48758 45:00:40 R qp32
job.txt-is
as follows :
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
48638.tyrone-cluster ...01R-1850-01_2 mcbkss 0 Q qp32
48738.tyrone-cluster case3sqTS1e-4 mecvamsi 588:30:5 R qp32
48758.tyrone-cluster meshA5 mecmdjim 45:00:40 R qp32
EDIT:The above file can also be in other format like this the text of the file without spaces is as below.the original file consists of spaces like the above code.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - ----
48998.tyrone-cluster gic1_nwgs mbupi 18:45:44 R qp32
48999.tyrone-cluster gic2_nwgs mbupi 0 Q batch
49005.tyrone-cluster ...01R-1849-01_2 mcbkss 00:44:23 R qp32
8687.tyrone-cluster gaussian_top.sh chemraja 0 Q qp32
49047.tyrone-cluster jet_egrid asevelt 312:33:0 R qp128
49052.tyrone-cluster case3sqTS1e-4 mecvamsi 0 Q qp32
49053.tyrone-cluster ...01R-1850-01_1 mcbkss 0 Q batch
49054.tyrone-cluster ...01R-1850-01_2 mcbkss 0 Q batch
So each time the format changes can anyone help me in getting a generalized function to handle all these different types in the file.
Upvotes: 0
Views: 87
Reputation: 5875
You can parse the lines with a regular expression. The fields you want to display, you can place them in capture groups by surrounding the relevant parts of the regular expression with parentheses. You can pull out those capture groups using the group() method on your regex match result.
import re
# joblist list will store each line of parsed output
joblist=[]
prog = re.compile('^(\d+)\..*\s+.*\s+\w+\s+(.*)\s+(\w)\s+(.*)$')
with open('job.txt','r') as jobfile:
for line in jobfile.readlines():
result = prog.match(line)
# Handle header line and skip lines that don't match regex
if result is None:
continue
else:
joblist.append(' '.join([result.group(1), result.group(2), result.group(3), result.group(4)]))
# displaying the list
for job in joblist:
print job
The data you provided:
macbook:Downloads joeyoung$ cat job.txt
Job id Name User Time Use S Queue
48638.cluster ...01R-1850-01_2 mcbkss 0 Q qp32
48738.cluster case3sqTS1e-4 mecvamsi 588:30:5 R qp32
48758.cluster meshA5 mecmdjim 45:00:40 R qp32
48638.tyrone-cluster ...01R-1850-01_2 mcbkss 0 Q qp32
48708.tyrone-cluster ...onwgs_entries mbupi 0 Q qp32
48736.tyrone-cluster ...01R-1850-01_1 mcbkss 0 Q batch
48737.tyrone-cluster ...01R-1850-01_2 mcbkss 0 Q batch
The output of the script on the above data (including the newly request time field in column 2):
macbook:Downloads joeyoung$ python parsejob.py
48638 0 Q qp32
48738 588:30:5 R qp32
48758 45:00:40 R qp32
48638 0 Q qp32
48708 0 Q qp32
48736 0 Q batch
48737 0 Q batch
The parsed data is available in the joblist list variable.
Upvotes: 0
Reputation: 142176
Regex is a bit overkill here, you can use string splitting instead and use islice
to ignore the first two lines. Take everything up to the first .
from those, then the last two words from the remainder, eg:
from itertools import islice
with open('job.txt') as fin:
for line in islice(fin, 2, None):
num, _, rest = line.partition('.')
_, letter, code = rest.rsplit(None, 2)
print num, letter, code
Upvotes: 1