Reputation: 811
I need help for the following testcase (pattern) which need to be parsed (in python):
IO_SET(BLOCK, key1, value1, key2, value2, ... ,keyn, valuen);
where BLOCK and key are identifiers and value is identifier (macro definition) or number or function or numeric expression.
I can split it relatively easy (even with repetitive RE groups), except for the case when the value is a function, for e.g.
IO_SET(BLOCK, key1, function(1+2,3, val11), key2, val2, key3, (3U)+cVAL3);
p.s. it is allowed to have zero or more spaces around the parenthesis, semicolons and commas.
Probably this can be done with pyparsing, but I've faced with many problems, for example in case of
value = Word(nums)
a word "1a23" is parsed as value = "1"
Upvotes: 1
Views: 116
Reputation: 63709
Here is a parser for your sample. You have to define a recursive grammar (using a pyparsing Forward), since a function call can have arguments that are themselves function calls:
sample = """IO_SET(BLOCK, key1, function(1+2,3, val11), key2, val2, key3, (3U)+cVAL3);"""
from pyparsing import *
SEMI,LPAREN,RPAREN = map(Suppress,";()")
identifier = Combine(Optional(Word(nums+'_')) + Word(alphas, alphanums+'_'))
integer= Combine(Optional('-') + Word(nums))
realnum = Combine(integer.copy() + '.' + Optional(Word(nums)))
fn_call = Forward()
# this order is *critical*
value = realnum | fn_call | identifier | integer
expr = infixNotation(value,
[
(oneOf('* /'), 2, opAssoc.LEFT),
(oneOf('+ -'), 2, opAssoc.LEFT),
])
fn_call <<= Group(identifier + LPAREN + Group(Optional(delimitedList(expr))) + RPAREN)
print value.parseString(sample).asList()
Prints:
[['IO_SET', ['BLOCK', 'key1', ['function', [['1', '+', '2'], '3', 'val11']],
'key2', 'val2', 'key3', ['3U', '+', 'cVAL3']]]]
As indicated in the comment, the order of expressions in value is critical. Since this language supports identifiers that can start with a numeric character, you have to test for and identifier before testing for an integer (else the leading digit will be interpreted as an integer and the rest of the string will be left hanging).
You could try some alternatives to relying on this ordering:
use the Or operator ('^') instead of MatchFirst ('|'), which will try all possible alternatives and choose the longest match (can infinitely recurse in recursive grammars like this one)
force integers to be followed by a word break (using pyparsing's WordEnd() class)
HTH
EDIT
Here is an updated version, with your clarified definitions. Since you had a clear regex for your integer form, easiest to just use the pyparsing Regex
class; and with this change, I reverted identifier
to a more conventional form. I also added key-value structure to the function arguments, but had to define two varieties of function call since your argument function call does not conform to the structured argument list. And using the new pprint
method makes it easier to see your arg list structure.
sample = """IO_SET(BLOCK, key1, function(1+2,3, val11), key2, val2, key3, (3U)+cVAL3);"""
from pyparsing import *
SEMI,LPAREN,RPAREN,COMMA = map(Suppress,";(),")
#identifier = Combine(Optional(Word(nums+'_')) + Word(alphas, alphanums+'_'))
identifier = Word(alphas, alphanums+'_')
#integer= Combine(Optional('-') + Word(nums))
integer = Regex(r"[+-]?\d+[Uu]?[Ll]?")
realnum = Combine(integer.copy() + '.' + Optional(Word(nums)))
fn_call1 = Forward()
fn_call2 = Forward()
# this order is *critical*
value = realnum | fn_call1 | fn_call2 | identifier | integer
expr = infixNotation(value,
[
(oneOf('* /'), 2, opAssoc.LEFT),
(oneOf('+ -'), 2, opAssoc.LEFT),
])
key_value = Group(identifier + COMMA + expr)
kv_args = identifier + Optional(COMMA + delimitedList(key_value))
fn_call1 <<= Group(identifier + LPAREN + Group(kv_args) + RPAREN)
simple_args = Optional(delimitedList(expr))
fn_call2 <<= Group(identifier + LPAREN + Group(simple_args) + RPAREN)
value.parseString(sample).pprint()
Prints:
[['IO_SET',
['BLOCK',
['key1', ['function', [['1', '+', '2'], '3', 'val11']]],
['key2', 'val2'],
['key3', ['3U', '+', 'cVAL3']]]]]
Upvotes: 2