Reputation:
I have a file with following data:
<<row>>12|xyz|abc|2.34<</row>>
<<eof>>
The file may have several rows like this. I am trying to design a parser which will parse each row present in this file and return an array with all rows. What would be the best way of doing it? The code has to be written in python. Code should not take rows that do not start with <<row>>
or should raise error.
=======> UPDATE <========
I just found that a particular <<row>>
can span multiple lines. So my code and the code present below aren't working anymore. Can someone please suggest an efficient solution?
The data files can contain hundreds to several thousands of rows.
Upvotes: 0
Views: 2667
Reputation: 969
A good practice is to test for unwanted cases and ignore them. Once you are sure that you have a compliant line, you process it. Note that the actual processing is not in an if statement. Without rows split across several lines, you need only two tests:
rows = list()
with open('newfile.txt') as file:
for line in file.readlines():
line = line.strip()
if not line.startswith('<<row>>'):
continue
if not line[-8:] == '<</row>>':
continue
row = line[7:-8]
rows.append(row)
With rows split across several lines, you need to save the previous line in some situations:
rows = list()
prev = None
with open('newfile.txt') as file:
for line in file.readlines():
line = line.strip()
if not line.startswith('<<row>>') and prev is not None:
line = prev + line
if not line.startswith('<<row>>'):
continue
if not line[-8:] == '<</row>>':
prev = line
continue
row = line[7:-8]
rows.append(row)
prev = None
If needed, you can split columns with: cols = row.split('|')
Upvotes: 0
Reputation: 3797
A simple way without regular expressions:
output = []
with open('input.txt', 'r') as f:
for line in f:
if line == '<<eof>>':
break
elif not line.startswith('<<row>>'):
continue
else:
output.append(line.strip()[7:-8].split('|'))
This uses every line starting with <<row>>
until a line contains only <<eof>>
Upvotes: 1
Reputation: 59426
def parseFile(fileName):
with open(fileName) as f:
def parseLine(line):
m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
if m:
return m.groups()
return [ values for values in (
parseLine(line)
for line in f
if line.startswith('<<row>>')) if values ]
And? Am I different? ;-)
Upvotes: 1