Regular Expressions in data parsing Python

Question

I'm relatively to regular expressions and am amazed at how powerful they are. I have this project and was wondering if regular expressions would be appropriate and how to use them.

In this project I am given a file with a bunch of data. Here's a bit of it:

* File "miles.dat" from the Stanford GraphBase (C) 1993 Stanford University
* Revised mileage data for highways in the United States and Canada, 1949
* This file may be freely copied but please do not change it in any way!
* (Checksum parameters 696,295999341)

Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410

It has a city name and the state, then in brackets the latitude and longitude, then the population. In the next line the distance from that city to each of the cities listed before it in the data. The data goes on for 180 cities.

My job is to create 4 lists. One for the cities, one for the coordinates, one for population, and one for distances between cities. I know this is possible without regular expressions( I have written it), but the code is clunky and not as efficient as possible. What do you think would be the best way to approach this?

Hugh Bothwell · Accepted Answer

I would recommend a regex for the city lines and a list comprehension for the distances (a regex would be overkill and slower as well).

Something like

import re

CITY_REG = re.compile(r"([^[]+)$$([0-9.]+),([0-9.]+)$$(\d+)")
CITY_TYPES = (str, float, float, int)

def get_city(line):
    match = CITY_REG.match(line)
    if match:
        return [type(dat) for dat,type in zip(match.groups(), CITY_TYPES)]
    else:
        raise ValueError("Failed to parse {} as a city".format(line))

def get_distances(line):
    return [int(i) for i in line.split()]

then

>>> get_city("Youngstown, OH[4110.83,8065.14]115436")
['Youngstown, OH', 4110.83, 8065.14, 115436]

>>> get_distances("1513 2410")
[1513, 2410]

and you can use it like

# This code assumes Python 3.x
from itertools import count, zip_longest

def file_data_lines(fname, comment_start="* "):
    """
    Return lines of data from file
     (strip out blank lines and comment lines)
    """
    with open(fname) as inf:
        for line in inf:
            line = line.rstrip()
            if line and not line.startswith(comment_start):
                yield line

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

def city_data(fname):
    data = file_data_lines(fname)

    # city 0 has no distances line
    city_line = next(data)
    city, lat, lon, pop = get_city(city_line)
    yield city, (lat, lon), pop, []

    # all remaining cities
    for city_line, dist_line in grouper(data, 2, ''):
        city, lat, lon, pop = get_city(city_line)
        dists = get_distances(dist_line)
        yield city, (lat, lon), pop, dists

and finally

def main():
    # load per-city data
    city_info = list(city_data("miles.dat"))
    # transpose into separate lists
    cities, coords, pops, dists = list(zip(*city_info))

if __name__=="__main__":
    main()

Edit:

How it works:

CITY_REG = re.compile(r"([^[]+)$$([0-9.]+),([0-9.]+)$$(\d+)")

[^[] matches any character except [; so ([^[]+) gets one or more characters up to (but not including) the first [; this gets "City Name, State", and returns it as the first group.

\[ matches a literal [ character; we have to escape it with a slash to make it clear that we are not starting another character-group.

[0-9.] matches 0, 1, 2, 3, ... 9, or a period character. So ([0-9.]+) gets one or more digits or periods - ie any integer or floating-point number, not including a mantissa - and returns it as the second group. This is under-constrained - it would accept something like 0.1.2.3, which is not a valid float - but an expression which only matched valid floats would be quite a bit more complicated, and this is sufficient for this purpose, assuming we will not run into anomalous input.

We get the comma, match another number as group 3, get the closing square-bracket; then \d matches any digit (same as [0-9]), so (\d+) matches one or more digits, ie an integer, and returns it as the fourth group.

match = CITY_REG.match(line)

We run the regular expression against a line of input; if it matches, we get back a Match object containing the matched data, otherwise we get None.

if match:

... this is a short-form way of saying if bool(match) == True. bool(MyClass) is always True (except when specifically overridden, ie for empty lists or dicts), bool(None) is always False, so effectively "if the regular expression successfully matched the string:".

CITY_TYPES = (str, float, float, int)

Regular expressions only return strings; you want different data types, so we have to convert, which is what

[type(dat) for dat,type in zip(match.groups(), CITY_TYPES)]

does; match.groups() is the four pieces of matched data, and CITY_TYPES is the desired data-type for each, so zip(data, types) returns something like [("Youngstown, OH", str), ("4110.83", float), ("8065.14", float), ("115436", int)]. We then apply the data type to each piece, ending up with ["Youngstown, OH", 4110.83, 8065.14, 115436].

Hope that helps!

Regular Expressions in data parsing Python

Answers (2)

Related Questions