ivand58
ivand58

Reputation: 811

Not so simple recursive descent in PyParsing

I need help for the following testcase (pattern) which need to be parsed (in python):

IO_SET(BLOCK, key1, value1, key2, value2, ... ,keyn, valuen);

where BLOCK and key are identifiers and value is identifier (macro definition) or number or function or numeric expression.

I can split it relatively easy (even with repetitive RE groups), except for the case when the value is a function, for e.g.

IO_SET(BLOCK, key1, function(1+2,3, val11), key2, val2, key3, (3U)+cVAL3); 

p.s. it is allowed to have zero or more spaces around the parenthesis, semicolons and commas.

Probably this can be done with pyparsing, but I've faced with many problems, for example in case of value = Word(nums) a word "1a23" is parsed as value = "1"

Upvotes: 1

Views: 116

Answers (1)

PaulMcG
PaulMcG

Reputation: 63709

Here is a parser for your sample. You have to define a recursive grammar (using a pyparsing Forward), since a function call can have arguments that are themselves function calls:

sample = """IO_SET(BLOCK, key1, function(1+2,3, val11), key2, val2, key3, (3U)+cVAL3);"""


from pyparsing import *

SEMI,LPAREN,RPAREN = map(Suppress,";()")
identifier = Combine(Optional(Word(nums+'_')) + Word(alphas, alphanums+'_'))
integer= Combine(Optional('-') + Word(nums))
realnum = Combine(integer.copy() + '.' + Optional(Word(nums)))

fn_call = Forward()
# this order is *critical*
value = realnum | fn_call | identifier | integer

expr = infixNotation(value,
            [
            (oneOf('* /'), 2, opAssoc.LEFT),
            (oneOf('+ -'), 2, opAssoc.LEFT),
            ])
fn_call <<= Group(identifier + LPAREN + Group(Optional(delimitedList(expr))) + RPAREN)


print value.parseString(sample).asList()

Prints:

[['IO_SET', ['BLOCK', 'key1', ['function', [['1', '+', '2'], '3', 'val11']], 
            'key2', 'val2', 'key3', ['3U', '+', 'cVAL3']]]]

As indicated in the comment, the order of expressions in value is critical. Since this language supports identifiers that can start with a numeric character, you have to test for and identifier before testing for an integer (else the leading digit will be interpreted as an integer and the rest of the string will be left hanging).

You could try some alternatives to relying on this ordering:

  • use the Or operator ('^') instead of MatchFirst ('|'), which will try all possible alternatives and choose the longest match (can infinitely recurse in recursive grammars like this one)

  • force integers to be followed by a word break (using pyparsing's WordEnd() class)

HTH

EDIT

Here is an updated version, with your clarified definitions. Since you had a clear regex for your integer form, easiest to just use the pyparsing Regex class; and with this change, I reverted identifier to a more conventional form. I also added key-value structure to the function arguments, but had to define two varieties of function call since your argument function call does not conform to the structured argument list. And using the new pprint method makes it easier to see your arg list structure.

sample = """IO_SET(BLOCK, key1, function(1+2,3, val11), key2, val2, key3, (3U)+cVAL3);"""

from pyparsing import *

SEMI,LPAREN,RPAREN,COMMA = map(Suppress,";(),")
#identifier = Combine(Optional(Word(nums+'_')) + Word(alphas, alphanums+'_'))
identifier = Word(alphas, alphanums+'_')
#integer= Combine(Optional('-') + Word(nums))
integer = Regex(r"[+-]?\d+[Uu]?[Ll]?")
realnum = Combine(integer.copy() + '.' + Optional(Word(nums)))

fn_call1 = Forward()
fn_call2 = Forward()
# this order is *critical*
value = realnum | fn_call1 | fn_call2 | identifier | integer

expr = infixNotation(value,
            [
            (oneOf('* /'), 2, opAssoc.LEFT),
            (oneOf('+ -'), 2, opAssoc.LEFT),
            ])
key_value = Group(identifier + COMMA + expr)
kv_args = identifier + Optional(COMMA + delimitedList(key_value))
fn_call1 <<= Group(identifier + LPAREN + Group(kv_args) + RPAREN)
simple_args = Optional(delimitedList(expr))
fn_call2 <<= Group(identifier + LPAREN + Group(simple_args) + RPAREN)

value.parseString(sample).pprint()

Prints:

[['IO_SET',
  ['BLOCK',
   ['key1', ['function', [['1', '+', '2'], '3', 'val11']]],
   ['key2', 'val2'],
   ['key3', ['3U', '+', 'cVAL3']]]]]

Upvotes: 2

Related Questions