Ryan James
Ryan James

Reputation: 175

Python - Parse text file and create lists based on some criteria

I've been searching for a solution to this question for a while without any luck. I'm wanting to use Python to read a text file and create some lists (or arrays) based on the data in the file. An example will best illustrate my goal.

Consider the following text:

NODE
1.0, 2.0
2.0, 2.0
3.0, 2.0
4.0, 2.0
ELEMENT
1, 2, 3, 4
5, 6, 7, 8
1, 2, 3, 4
1, 2, 3, 4
1, 2, 3, 4
5, 6, 7, 8
5, 6, 7, 8
5, 6, 7, 8

I would like to read through the file (ideally once as the files can be large) and once I find "NODE" take each line between "NODE" and "ELEMENT" and put into a list. Then, once I reach "ELEMENT" take each line between "ELEMENT" and some other break (maybe another "ELEMENT" or end of file, etc…) and put that into a list. For this example,it would result in two lists.

I've tried various things but they all require knowing information about the file beforehand. I'd like to be able to automate it. Thank you very much!

Upvotes: 0

Views: 1521

Answers (4)

abarnert
abarnert

Reputation: 365747

For the simpler problem in the updated question, you really don't need regexps, or groupby, or a complex state machine, or anything beyond what a novice should be able to understand easily.

All you need to do is accumulate rows into one list until you find the row 'ELEMENT', then start accumulating rows into the other one. Like this:

import csv
result = {'NODES': [], 'ELEMENTS': []}
current = result['NODES']
with open(path) as f:
    for row in csv.reader(f):
        if row == ['NODE']:
            pass
        elif row == ['ELEMENT']:
            current = result['ELEMENTS']
        else:
            current.append(row)

Upvotes: 0

dawg
dawg

Reputation: 103864

With that example data, and assuming that the labels follow the pattern in your example, you can use a regex:

import re, mmap, os

def conv(s):
    try:
        return float(s)
    except ValueError:
        return s    

data_dict={}
with open(fn, 'r') as fin:
    size = os.stat(fn).st_size
    data = mmap.mmap(fin.fileno(), size, access=mmap.ACCESS_READ)
    for m in re.finditer(r'^(\w+)$([\d\s,.]+)', data, re.M):
        data_dict[m.group(1)]=[[conv(e) for e in line.split(',')] 
                        for line in m.group(2).splitlines() if line.strip()]

print data_dict

Prints:

{'NODE': [[1.0, 2.0], [2.0, 2.0], [3.0, 2.0], [4.0, 2.0]], 
 'ELEMENT': [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0]]}

So, how does this work:

  1. We use mmap to apply a regex to a file
  2. We assume that the labels are the form of ^\w+$ (ie, labels made up of letters and numbers on a line)
  3. Then capture all the numbers and spaces following that
  4. Create a dict with the label as the key, the parsed numbers as the list of floats following.

Done!

Upvotes: 4

abarnert
abarnert

Reputation: 365747

If you want this to be fully general and automated, you need to come up with the rule that distinguishes section headers from rows. I'll invent one, but it's probably not the one you want, in which case my invented code won't work for you… but hopefully it will show you what you need to do, and how to get started.

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

Now, we can just group the rows by whether or not they're section headers by using itertools.groupby. If you printed out each group, you'd get something like this:

True, [['NODE']]
False, [['1.0', '2.0'], ['2.0', '2.0'], …, ]
True, [['ELEMENT']]
False, [['1.0', '2.0', '3.0', '4.0'], …, ]

We don't care about the first value in each of those, so drop it.

And we want to batch up each pair of adjacent groups into a (header, rows) pair, which we can do by zipping our iterator with itself.

And then just put it in a dict, which will look something like this:

{'NODE': [['1.0', '2.0'], ['2.0', '2.0'], …],
 'ELEMENT': [['1.0', '2.0', '3.0', '4.0'], …]}

Here's the whole thing:

import csv
import itertools

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

with open(path) as f:
    rows = csv.reader(f)
    grouped = itertools.groupby(rows, new_section)
    groups = (group for key, group in grouped)
    pairs = zip(groups, groups)
    lists = {header[0][0]: rows for header, rows in pairs}

Upvotes: 2

Joran Beasley
Joran Beasley

Reputation: 113988

def getBlocks(fname):
    state = 0 
    node = []
    ele = []
    with open(fname) as f:
        for line in f:
        if "NODE" in line:
            if state == 2:
            yield (node,ele)
            node,ele = [],[]   
            state = 1
        elif state == 1 and "ELEMENT" in line:
            state = 2
        elif state == 1:
            node.append(list(map(float,line.split(","))))
        elif state == 2 and re.match("[a-zA-Z]+",line):
            yield (node,ele)
            node,ele = [],[]   
            state = 0 
        elif state == 2:
            ele.append(list(map(int,line.split(","))))
        yield (node,ele)

for node,ele in getBlocks("somefile.txt"):
    print "N:",node
    print "E:",ele

might be about what your looking for its kinda gross... im sure you can do it better

Upvotes: 0

Related Questions