Reputation: 4206
I am trying to parse a text file that looks like this using pandas:
Some random text
more random text that may be of different length
JUNK 0 9 8
GOOD 0 1 1
GOOD 5 5 5
more random text interdispersed
GOOD 123 321 2
JUNK 55 1 9
GOOD 1 2 3
The file is space delimited. I only care about lines that start with 'GOOD', which will all have the same formatting.
I believe that read_table()
is the right command but I don't know how to filter it.
My current method of parsing files is to open the file, use regex to match the lines I care about and then split the line on spaces. This can be slow and I am looking for a faster cleaner way.
Upvotes: 2
Views: 1363
Reputation: 7618
Lets make a generator that filters good lines
def generate_good_lines(filename):
with open(filename) as f:
if line.startswith('GOOD'):
yield line
Now all you need is parse these lines in way you want, eg:
def generate_parsed(filename_list):
for filename in filename_list:
for line in generate_good_lines(filename)
data = your_parser_function(line)
yield data
Then you consume all lines into list (for example):
your_list = list(generate_parsed(your_filename_list))
in your question it looks like your_parser_function looks like this:
def your_parser_function(line):
return map(int, line[5:].split()) # split values and convert them to integers
generators take care about your memory and processor time consumption
/ sorry for my english /
Upvotes: 2
Reputation: 251398
You don't need regex to match lines that start with "GOOD". Just iterate over the file and throw away all other lines, creating a "clean" copy of the data you want:
with open('irregular.txt') as inFile, open('regular.txt', 'w') as outFile:
for line in inFile:
if line.startswith('GOOD'):
outFile.write(line)
Then you can read "regular.txt" using read_table
or read_csv
with the delim_whitespace=True
argument.
Upvotes: 4