roy
roy

Reputation: 3976

Parsing Text Structured with Indents in Python

I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). Example (note: indentation below not rendering on mobile version of SO):

Attendance records 8 F 1921-2010 Box 2 1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010 Number of meetings attended each year 1 F 1991-1994 Box 2 Papers re: Safaris 10 F 1951-2011 Box 2 Incomplete; Includes correspondence about beginning “Safaris” may also include announcements, invitations, reports, attendance, and charges; some photographs. See also: Correspondence and Minutes

So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves).

So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items:

import re

f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

if line[0].isalpha():
        processed = re.split('\s{2,}', line)


        for i in processed:
        title = processed[0]
        rec_id = processed[1]
        years = processed[2]
        location = processed[3]

    records.append({
        "title": title,
        "id": rec_id,
        "years": years,
        "location": location
    })


elif not line[0].isalpha():

    print "These are the notes, but attaching them to the above records is not clear"


print records`

and this produces:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010'}, {'id': '1 F', 'location': 'Box 2', 'title': 'Number of meetings attended each year', 'years': '1991-1994'}, {'id': '10 F', 'location': 'Box 2', 'title': 'Papers re: Safaris', 'years': '1951-2011'}]

But now I want to add to each record the notes to the effect of:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010', 'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010' }, ...]

What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this.

Update Just adding the condition suggested by answer below over the indented lines worked fine:

import re
import repr as _repr
from pprint import pprint


f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

    if line[0].isalpha():
        processed = re.split('\s{2,}', line)

        #print processed

        for i in processed:
            title = processed[0]
            rec_id = processed[1]
            years = processed[2]
            location = processed[3]

    if not line[0].isalpha():


        record['notes'].append(line)
        continue

    record = { "title": title,
               "id": rec_id,
               "years": years,
               "location": location,
               "notes": []}

    records.append(record)





pprint(records)

Upvotes: 0

Views: 1034

Answers (1)

Hyperboreus
Hyperboreus

Reputation: 32429

As you have already solved the parsing of the records, I will only focus on how to read the notes of each one:

records = []

with open('data.txt', 'r') as lines:
    for line in lines:
        if line.startswith ('\t'):
            record ['notes'].append (line [1:])
            continue
        record = {'title': line, 'notes': [] }
        records.append (record)

for record in records:
    print ('Record is', record ['title'] )
    print ('Notes are', record ['notes'] )
    print ()

Upvotes: 1

Related Questions