TheMeaningfulEngineer
TheMeaningfulEngineer

Reputation: 16339

Creating a complex data structure by parsing an output file

I'm looking for some advice on how to create a data structure by parsing a file. This is the list i have in my file.

'01bpar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02',
'01bpar( 3)=  0.00000000E+00',
'02epar( 1)=  0.49998963E+02',
'02epar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02',
'02epar( 3)=  0.00000000E+00',
'02epar( 4)=  0.17862340E-01  half_life=  0.3880495E+02  relax_time=  0.5598371E+02',
'02bpar( 1)=  0.49998962E+02',
'02bpar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02',

What I need to do is construct a data structure which chould look like this:

http://img11.imageshack.us/img11/7645/datastructure.gif

(couldn't post it becouse of new user restriction)

I've managed to get all the regexp filters to get what is needed, but i fail to construct the structure. Ideas?

Upvotes: 0

Views: 227

Answers (3)

John Percival Hackworth
John Percival Hackworth

Reputation: 11531

Consider using a dict of dicts.

#!/usr/bin/env python
import re
import pprint
raw = """'01bpar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02',
'01bpar( 3)=  0.00000000E+00',
'02epar( 1)=  0.49998963E+02',
'02epar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02',
'02epar( 3)=  0.00000000E+00',
'02epar( 4)=  0.17862340E-01  half_life=  0.3880495E+02  relax_time=  0.5598371E+02',
'02bpar( 1)=  0.49998962E+02',
'02bpar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02',"""

datastruct = {}
pattern = re.compile(r"""\D(?P<digits>\d+)(?P<field>[eb]par)[^=]+=\D+(?P<number>\d+\.\d+E[+-]\d+)""")
for line in raw.splitlines():
    result = pattern.search(line)
    parts = result.groupdict()
    if not parts['digits'] in datastruct:
        datastruct[parts['digits']] = {'epar':[], 'bpar':[]}
    datastruct[parts['digits']][parts['field']].append(parts['number'])

pprint.pprint(datastruct, depth=4)

Produces:

{'01': {'bpar': ['0.23103878E-01', '0.00000000E+00'], 'epar': []},
 '02': {'bpar': ['0.49998962E+02', '0.23103878E-01'],
        'epar': ['0.49998963E+02',
                 '0.23103878E-01',
                 '0.00000000E+00',
                 '0.17862340E-01']}}

Revised version in light of comments:

pattern = re.compile(r"""\D(?P<digits>\d+)(?P<field>[eb]par)[^=]+=\D+(?P<number>\d+\.\d+E[+-]\d+)""")

default = lambda : dict((('epar',[]), ('bpar',[])))
datastruct = defaultdict( default)

for line in raw.splitlines():
    result = pattern.search(line)
    parts = result.groupdict()
    datastruct[parts['digits']][parts['field']].append(parts['number'])

pprint.pprint(datastruct.items())

which produces:

[('02',
  {'bpar': ['0.49998962E+02', '0.23103878E-01'],
   'epar': ['0.49998963E+02',
            '0.23103878E-01',
            '0.00000000E+00',
            '0.17862340E-01']}),
 ('01', {'bpar': ['0.23103878E-01', '0.00000000E+00'], 'epar': []})]

Upvotes: 1

PaulMcG
PaulMcG

Reputation: 63709

It's theoretically possible to have pyparsing create the whole structure using parse actions, but if you just name the various fields as I have below, building up the structure is not too bad. And if you want to convert to using RE's, this example should give you a start on how things might look:

source = """\
'01bpar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02', 
'01bpar( 3)=  0.00000000E+00', 
'02epar( 1)=  0.49998963E+02', 
'02epar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02', 
'02epar( 3)=  0.00000000E+00', 
'02epar( 4)=  0.17862340E-01  half_life=  0.3880495E+02  relax_time=  0.5598371E+02', 
'02bpar( 1)=  0.49998962E+02', 
'02bpar( 2)=  0.23103878E-01  half_life=  0.3000133E+02  relax_time=  0.4328278E+02', """

from pyparsing import Literal, Regex, Word, alphas, nums, oneOf, OneOrMore, quotedString, removeQuotes

EQ = Literal('=').suppress()
scinotationnum = Regex(r'\d\.\d+E[+-]\d+')
dataname = Word(alphas+'_')
key = Word(nums,exact=2) + oneOf("bpar epar")
index = '(' + Word(nums) + ')'

keyedValue = key + EQ + scinotationnum

# define an item in the source - suppress values with keys, just want the unkeyed ones
item = key('key') + index + EQ + OneOrMore(keyedValue.suppress() | scinotationnum)('data')

# initialize summary structure
from collections import defaultdict
results = defaultdict(lambda : {'epar':[], 'bpar':[]})

# extract quoted strings from list
quotedString.setParseAction(removeQuotes)
for raw in quotedString.searchString(source):
    parts = item.parseString(raw[0])
    num,par = parts.key
    results[num][par].extend(parts.data)

# dump out results, or do whatever
from pprint import pprint
pprint(dict(results.iteritems()))

Prints:

{'01': {'bpar': ['0.23103878E-01', '0.00000000E+00'], 'epar': []},
 '02': {'bpar': ['0.49998962E+02', '0.23103878E-01'],
        'epar': ['0.49998963E+02',
                 '0.23103878E-01',
                 '0.00000000E+00',
                 '0.17862340E-01']}}

Upvotes: 3

Spencer Rathbun
Spencer Rathbun

Reputation: 14900

Your top level structure is positional, so it's a perfect choice for a list. Since lists can hold arbitrary items, then a named tuple is perfect. Each item in the tuple can hold a list with it's elements.

So, your code should look something like this pseudocode:

from collections import named tuple
data = []
newTuple = namedtuple('stuff', ['epar','bpar'])
for line in theFile.readlines():
    eparVals = regexToGetThemFromString()
    bparVals = regexToGetThemFromString()
    t = newTuple(eparVals, bparVals)
    data.append(t)

You said you could already loop over the file, and had various regex to get the data, so I didn't bother adding all the details.

Upvotes: 0

Related Questions