reading lines from file into lists for a specific field

Question

I have the following text file job.txt. I want to extract few fields like 48638 (without the words cluster),the time field ,Q in every line into a list.

Please do guide me, I have tried these :

content = [x.strip('
') for x in content]
stlist=content[2:]

to delete the first two lines but not able to get the output as below.

Output of the list must be like :

48758 45:00:40 R qp32

job.txt-is as follows :

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
48638.tyrone-cluster             ...01R-1850-01_2 mcbkss                 0 Q qp32           
48738.tyrone-cluster             case3sqTS1e-4    mecvamsi        588:30:5 R qp32          
48758.tyrone-cluster             meshA5           mecmdjim        45:00:40 R qp32

EDIT:The above file can also be in other format like this the text of the file without spaces is as below.the original file consists of spaces like the above code.

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - ----
48998.tyrone-cluster          gic1_nwgs                  mbupi           18:45:44           R             qp32           
48999.tyrone-cluster           gic2_nwgs           mbupi                  0 Q batch          
49005.tyrone-cluster        ...01R-1849-01_2 mcbkss          00:44:23 R qp32           
8687.tyrone-cluster        gaussian_top.sh  chemraja               0 Q qp32           
49047.tyrone-cluster        jet_egrid        asevelt         312:33:0 R qp128          
49052.tyrone-cluster        case3sqTS1e-4    mecvamsi               0 Q qp32           
49053.tyrone-cluster         ...01R-1850-01_1 mcbkss                 0 Q batch          
49054.tyrone-cluster        ...01R-1850-01_2 mcbkss                 0 Q batch

So each time the format changes can anyone help me in getting a generalized function to handle all these different types in the file.

Joe Young · Accepted Answer

You can parse the lines with a regular expression. The fields you want to display, you can place them in capture groups by surrounding the relevant parts of the regular expression with parentheses. You can pull out those capture groups using the group() method on your regex match result.

import re

# joblist list will store each line of parsed output
joblist=[]
prog = re.compile('^(\d+)\..*\s+.*\s+\w+\s+(.*)\s+(\w)\s+(.*)$')
with open('job.txt','r') as jobfile:
        for line in jobfile.readlines():
                result = prog.match(line)
        # Handle header line and skip lines that don't match regex
                if result is None:
                    continue
                else:
                    joblist.append(' '.join([result.group(1), result.group(2), result.group(3), result.group(4)]))

# displaying the list
for job in joblist:
    print job

The data you provided:

macbook:Downloads joeyoung$ cat job.txt
Job id                    Name             User            Time Use S Queue
48638.cluster ...01R-1850-01_2 mcbkss 0 Q qp32
48738.cluster case3sqTS1e-4 mecvamsi 588:30:5 R qp32
48758.cluster meshA5 mecmdjim 45:00:40 R qp32
48638.tyrone-cluster ...01R-1850-01_2 mcbkss 0 Q qp32
48708.tyrone-cluster ...onwgs_entries mbupi 0 Q qp32
48736.tyrone-cluster ...01R-1850-01_1 mcbkss 0 Q batch
48737.tyrone-cluster ...01R-1850-01_2 mcbkss 0 Q batch

The output of the script on the above data (including the newly request time field in column 2):

macbook:Downloads joeyoung$ python parsejob.py
48638 0 Q qp32
48738 588:30:5 R qp32
48758 45:00:40 R qp32
48638 0 Q qp32
48708 0 Q qp32
48736 0 Q batch
48737 0 Q batch

The parsed data is available in the joblist list variable.

reading lines from file into lists for a specific field

Answers (2)

Related Questions