Reputation: 313
I have a group of 500-600 files I want to search thru and extract data. I'm trying to use pyparsing with very limited success. There are only 3 things in a file (1) comments, (2) simple assignments and (3) nested assignments. The nesting gets about 6 levels deep.
My goal is to look at a particular value in a 3 level deep field and if it has a particular value, pull out a value from another 3rd level field that is part of the same 2nd level field.
First, is pyparsing the proper tool for doing this? Other recommendations if not?
I know how to build a list of files and iterate over them. Let me show a sample file and then the code I'm trying.
# TOP_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TOP_OBJECT=
(
obj_fmt=
(
obj_name="foo"
obj_cre_date=737785182 # = Tue May 18 23:19:42 1993
opj_data=
(
a="continue"
b="quit"
)
obj_version=264192 # = Version 4.8.0
)
# LEVEL1_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL1_OBJECT=
(
OBJ_part=
(
obj_type=1005
obj_size=120
)
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL2_OBJECT_A=
(
OBJ_part=
(
obj_type=3001
obj_size=128
)
Another_part=
(
another_attr=
(
another_style=0
another_param=2
)
)
) ### End of LEVEL2_OBJECT_A ###
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL2_OBJECT_B=
(
OBJ_part=
(
obj_type=3005
obj_size=128
)
Another_part=
(
another_attr=
(
another_style=0
another_param=8
)
)
) ### End of LEVEL2_OBJECT_B ###
) ### End of LEVEL1 OBJECT
) ### End of TOP_OBJECT ###
My code to digest the file looks like this:
from pyparsing import *
def Syntax():
comment = Group("#" + restOfLine).suppress()
eq = Literal('=')
lpar = Literal( '(' ).suppress()
rpar = Literal( ')' ).suppress()
num = Word(nums)
var = Word(alphas + "_")
simpleAssign = var + eq
nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar)
expr = Forward()
atom = nestedAssign | simpleAssign
expr << atom
expr.ignore(comment)
return expr
def main():
expr = Syntax()
results = expr.parseFile( "for_show.asc" )
print results
if __name__ == '__main__':
main()
My results don't descend: ['TOP_OBJECT', '=']
Right now I'm not handling quoted strings or numbers, just trying to understand parsing nested lists.
Upvotes: 1
Views: 258
Reputation: 63762
Mostly there, just a few gaps in your parser - see the commented-out original code, compared to the current code:
def Syntax():
comment = Group("#" + restOfLine).suppress()
eq = Literal('=')
lpar = Literal( '(' ).suppress()
rpar = Literal( ')' ).suppress()
num = Word(nums)
#~ var = Word(alphas + "_")
var = Word(alphas + "_", alphanums+"_")
#~ simpleAssign = var + eq
expr = Forward()
simpleAssign = var + eq + (num | quotedString)
#~ nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar)
nestedAssign = var + eq + Group(lpar + OneOrMore(expr) + rpar)
atom = nestedAssign | simpleAssign
expr << atom
expr.ignore(comment)
return expr
This gives:
['TOP_OBJECT',
'=',
['obj_fmt',
'=',
['obj_name',
'=',
'"foo"',
'obj_cre_date',
'=',
'737785182',
'opj_data',
'=',
['a', '=', '"continue"', 'b', '=', '"quit"'],
'obj_version',
'=',
'264192'],
'LEVEL1_OBJECT',
'=',
['OBJ_part',
'=',
['obj_type', '=', '1005', 'obj_size', '=', '120'],
'LEVEL2_OBJECT_A',
'=',
['OBJ_part',
'=',
['obj_type', '=', '3001', 'obj_size', '=', '128'],
'Another_part',
'=',
['another_attr',
'=',
['another_style', '=', '0', 'another_param', '=', '2']]],
'LEVEL2_OBJECT_B',
'=',
['OBJ_part',
'=',
['obj_type', '=', '3005', 'obj_size', '=', '128'],
'Another_part',
'=',
['another_attr',
'=',
['another_style', '=', '0', 'another_param', '=', '8']]]]]]
If you wrap the expr
inside nestedAssign's OneOrMore with Group
nestedAssign = var + eq + Group(lpar + OneOrMore(Group(expr)) + rpar)
, I think you'll get better structure for your repeated nested assignments:
['TOP_OBJECT',
'=',
[['obj_fmt',
'=',
[['obj_name', '=', '"foo"'],
['obj_cre_date', '=', '737785182'],
['opj_data', '=', [['a', '=', '"continue"'], ['b', '=', '"quit"']]],
['obj_version', '=', '264192']]],
['LEVEL1_OBJECT',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '1005'], ['obj_size', '=', '120']]],
['LEVEL2_OBJECT_A',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '3001'], ['obj_size', '=', '128']]],
['Another_part',
'=',
[['another_attr',
'=',
[['another_style', '=', '0'], ['another_param', '=', '2']]]]]]],
['LEVEL2_OBJECT_B',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '3005'], ['obj_size', '=', '128']]],
['Another_part',
'=',
[['another_attr',
'=',
[['another_style', '=', '0'], ['another_param', '=', '8']]]]]]]]]]]
Also, your originally posted code contained TABs, I find them to be more trouble than they are worth, better off using 4-space indents.
Upvotes: 1