chapter3
chapter3

Reputation: 984

Design a module to parse text file

I really don't believe in generic text file parser anymore - especially those files are meant for human readers. Files like HTML and web log can be well handled by Beautiful Soap or Regular Expression. But the human readable text file is still a tough nut to crack.

Just that I am willing to hand-coded a text file parser, tailoring every different format I would encounter. I still want to see if it is possible to have a better program structure in the way that I will still able to understand the program logic 3 months down the road. Also to make it readable.

Today I was given a problem to extract the time-stamps from a file:

"As of 12:30:45, ..."
"Between 1:12:00 and 3:10:45, ..."
"During this time from 3:44:50 to 4:20:55 we have ..."

The parsing is straightforward. I have the time-stamps in different locations on each line. But I am think how should I design the module/function in the way that: (1) each line format will be handle separately, (2) how to branch to the relevant function. For example, I can code each line parser like this:

def parse_as(s):
    return s.split(' ')[2], s.split(' ')[2] # returning the second same as the first for the case that only one time stamp is found

def parse_between(s):
    return s.split(' ')[2], s.split(' ')[4]

def parse_during(s):
    return s.split(' ')[4], s.split(' ')[6]

This can help me to have a quick idea about the formats already handled by the program. I can always add a new function in case I encounter another new format.

However, I still don't have an elegant way to branch to the relevant function.

# open file
for l in f.readline():
    s = l.split(' ')
    if s == 'As': 
       ts1, ts2 = parse_as(l)
    else:
       if s == 'Between':
          ts1, ts2 = parse_between(l)
       else:
          if s == 'During':
             ts1, ts2 = parse_during(l)
          else:
             print 'error!'
    # process ts1 and ts2

That's not something I want to maintain.

Any suggestion? There was once I thought decorator might help but I couldn't sort it out myself. Appreciate if anyone can point me to the correct direction.

Upvotes: 4

Views: 468

Answers (3)

Steve Cohen
Steve Cohen

Reputation: 712

Why not use a regular expression?

import re

# open file
with open('datafile.txt') as f:
    for line in f:
        ts_vals = re.findall(r'(\d+:\d\d:\d\d)', line)
        # process ts1 and ts2

Thus ts_vals will be a list with either one or two elements for the examples provided.

Upvotes: 0

Orelus
Orelus

Reputation: 1023

What about

start_with = ["As", "Between", "During"]
parsers = [parse_as, parse_between, parse_during]


for l in f.readlines():
    match_found = False

    for start, f in zip(start_with, parsers):
        if l.startswith(start):
            ts1, ts2 = f(l.split(' '))
            match_found = True
            break

    if not match_found:
        raise NotImplementedError('Not found!')

or with a dict as Ian mentioned:

rules = {
    "As": parse_as,
    "Between": parse_between,
    "During": parse_during
}

for l in f.readlines():
    match_found = False

    for start, f in rules.items():
        if l.startswith(start):
            ts1, ts2 = f(l.split(' '))
            match_found = True
            break

    if not match_found:
        raise NotImplementedError('Not found!')

Upvotes: 1

Ian
Ian

Reputation: 30813

Consider of using dictionary mapping:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during
}

Then you only need to use it like this:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during
}

for l in f.readline():
    s = l.split(' ')
    p = dmap.get(s, None)
    if p is None:
        print('error')
    else:
        ts1, ts2 = p(l)
        #continue to process

A lot easier to maintain. If you have new function, you just need to add it into the dmap together with its keyword:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during,
    'After': parse_after,
    'Before': parse_before
    #and so on
}

Upvotes: 3

Related Questions