Reputation: 33
I recently started using pyparsing and I'm stuck with following: There are data organized in columns where number of columns is not known and additionally such section can occur multiple times in input. Please see code below for example.
# -*- coding: utf-8 -*-
from pyparsing import *
from decimal import Decimal
def convert_float(a):
return Decimal(a[0].replace(',','.'))
def convert_int(a):
return int(a[0])
NL = LineEnd().suppress()
dot = Literal('.')
dates = Combine(Word(nums,exact=2) + dot + Word(nums,exact=2) + dot + Word(nums,exact=4))
day_with_date = Word(alphas,exact=3).suppress() + dates
amount = ( Combine(OneOrMore(Word(nums)) + ',' + Word(nums),adjacent=False) +
Optional(Literal('EUR')).suppress() ).setParseAction(convert_float)
number = Word(nums).setParseAction(convert_int)
item_head = OneOrMore(Keyword('Item').suppress() + number)
item_det = Forward()
item_foot = Forward()
def defineColNumber(t):
nbcols = len(t)#[0])
item_det << Dict(Group(day_with_date('date') + Group(nbcols*amount)('data')))
item_foot << Keyword('TOTAL').suppress() + Group(nbcols*amount)
sec = (item_head('it*').setParseAction(defineColNumber) +
Group(OneOrMore(item_det))('details*') +
item_foot('totals*'))
parser = OneOrMore(
sec
)
parser.ignore(NL)
out = """
Item 1 Item 2 Item 3
Sat 20.04.2013 3 126 375,00 EUR 115 297,00 EUR 67 830,00 EUR
Fri 19.04.2013 1 641 019,20 EUR 82 476,00 EUR 48 759,00 EUR
Thu 18.04.2013 548 481,10 EUR 46 383,00 EUR 29 810,00 EUR
Wed 17.04.2013 397 396,70 EUR 42 712,00 EUR 26 812,00 EUR
TOTAL 8 701 732,00 EUR 1 661 563,00 EUR 1 207 176,00 EUR
Item 4 Item 5
Sat 20.04.2013 126 375,00 EUR 215 297,00 EUR
Fri 19.04.2013 2 641 019,20 EUR 32 476,00 EUR
Thu 18.04.2013 548 481,10 EUR 56 383,00 EUR
Wed 17.04.2013 397 396,70 EUR 42 712,00 EUR
TOTAL 2 701 732,00 EUR 1 663 563,00 EUR
"""
p = parser.parseString(out, parseAll=True)
print p.dump()
print p.it
print p.details[0]['18.04.2013'].data[2]
print p.totals
Currently for example p.it looks like [[1, 2, 3], [4, 5]]
What I need to have is [1,2,3,4,5]
as well for other parts, so instead of p.details[0]['18.04.2013'].data[2]
I could do p.details['18.04.2013'].data[2]
then.
I'm out of ideas - is it possible to join results in some easy way or I need to change ParseResults with some other function?
Thanks for help.
BTW - is this code makes sense regarding parsing dates, amount, etc.?
Upvotes: 2
Views: 404
Reputation: 63709
This kind of parsing of tabular data is one of the original cases that pyparsing was written for. Congratulations on getting this far with parsing a non-trivial input text!
Rather than try to do any unnatural Grouping or whatnot to twist or combine the parsed data into your desired data structure, I'd just walk the parsed results as you've got them and build up a new summary structure, which I'll call summary
. We are actually going to accumulate data into this dict, which strongly suggests using a defaultdict for simplified initialization of the summary when a new key is found.
from collections import defaultdict
summary = defaultdict(dict)
Looking at the current structure returned in p
, you are getting item headers and detailed data sets gathered into the named results it
and details
. We can zip these together to get each section's headers and data. Then for each line in the details, we'll make a dict of the detailed values by zipping the item headers with the parsed data values. Then we'll update the summary value that is keyed by the line.date
:
for items,details in zip(p.it,p.details):
for line in details:
summary[line.date[0]].update(dict(zip(items,line.data)))
Done! See what the keys are that we have accumulated:
print summary.keys()
gives:
['20.04.2013', '18.04.2013', '17.04.2013', '19.04.2013']
Print the data accumulated for '18.04.2013':
print summary['18.04.2013']
gives:
{1: Decimal('548481.10'), 2: Decimal('46383.00'), 3: Decimal('29810.00'), 4: Decimal('548481.10'), 5: Decimal('56383.00')}
Upvotes: 1