deltap
deltap

Reputation: 4206

Parsing a non regular text file in pandas

I am trying to parse a text file that looks like this using pandas:

Some random text
more random text that may be of different length
JUNK 0 9 8
GOOD 0 1 1
GOOD 5 5 5
more random text interdispersed
GOOD 123 321 2
JUNK 55 1 9
GOOD 1 2 3

The file is space delimited. I only care about lines that start with 'GOOD', which will all have the same formatting.

I believe that read_table() is the right command but I don't know how to filter it.

My current method of parsing files is to open the file, use regex to match the lines I care about and then split the line on spaces. This can be slow and I am looking for a faster cleaner way.

Upvotes: 2

Views: 1363

Answers (2)

akaRem
akaRem

Reputation: 7618

Lets make a generator that filters good lines

def generate_good_lines(filename):
    with open(filename) as f:
        if line.startswith('GOOD'):
            yield line

Now all you need is parse these lines in way you want, eg:

def generate_parsed(filename_list):
    for filename in filename_list:
        for line in generate_good_lines(filename)
            data = your_parser_function(line)
            yield data

Then you consume all lines into list (for example):

your_list = list(generate_parsed(your_filename_list))

in your question it looks like your_parser_function looks like this:

def your_parser_function(line):
    return map(int, line[5:].split()) # split values and convert them to integers

generators take care about your memory and processor time consumption

/ sorry for my english /

Upvotes: 2

BrenBarn
BrenBarn

Reputation: 251398

You don't need regex to match lines that start with "GOOD". Just iterate over the file and throw away all other lines, creating a "clean" copy of the data you want:

with open('irregular.txt') as inFile, open('regular.txt', 'w') as outFile:
    for line in inFile:
        if line.startswith('GOOD'):
            outFile.write(line)

Then you can read "regular.txt" using read_table or read_csv with the delim_whitespace=True argument.

Upvotes: 4

Related Questions