Reputation: 3976
I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). Example (note: indentation below not rendering on mobile version of SO):
Attendance records 8 F 1921-2010 Box 2
1921-1927, 1932-1944
1937-1939,1948-1966,
1971-1979, 1989-1994, 2010
Number of meetings attended each year 1 F 1991-1994 Box 2
Papers re: Safaris 10 F 1951-2011 Box 2
Incomplete; Includes correspondence
about beginning “Safaris” may also
include announcements, invitations,
reports, attendance, and charges; some
photographs.
See also: Correspondence and Minutes
So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves).
So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items:
import re
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
records.append({
"title": title,
"id": rec_id,
"years": years,
"location": location
})
elif not line[0].isalpha():
print "These are the notes, but attaching them to the above records is not clear"
print records`
and this produces:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010'},
{'id': '1 F',
'location': 'Box 2',
'title': 'Number of meetings attended each year',
'years': '1991-1994'},
{'id': '10 F',
'location': 'Box 2',
'title': 'Papers re: Safaris',
'years': '1951-2011'}]
But now I want to add to each record the notes to the effect of:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010',
'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010'
},
...]
What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this.
Update Just adding the condition suggested by answer below over the indented lines worked fine:
import re
import repr as _repr
from pprint import pprint
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
#print processed
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
if not line[0].isalpha():
record['notes'].append(line)
continue
record = { "title": title,
"id": rec_id,
"years": years,
"location": location,
"notes": []}
records.append(record)
pprint(records)
Upvotes: 0
Views: 1034
Reputation: 32429
As you have already solved the parsing of the records, I will only focus on how to read the notes of each one:
records = []
with open('data.txt', 'r') as lines:
for line in lines:
if line.startswith ('\t'):
record ['notes'].append (line [1:])
continue
record = {'title': line, 'notes': [] }
records.append (record)
for record in records:
print ('Record is', record ['title'] )
print ('Notes are', record ['notes'] )
print ()
Upvotes: 1