Reputation: 2923
I'm trying to write a simplified regular expression parser (only support the *
and |
operators in addition to concatenation) using pyparsing. Here's my grammar so far:
from pyparsing import alphas, Word, Forward
regular_expression = Forward()
character = Word(alphas, max=1)
group = '(' + regular_expression + ')'
star = (character | group) + '*'
# A 'concat_expression' is any string concat of the above
# String concat with a 'concat_expression' also produces a 'concat_expression'
concat_expression = Forward()
concat_expression << ((character | group | star | concat_expression) +
(character | group | star))
or_expression = regular_expression + '|' + regular_expression
regular_expression << or_expression | concat_expression
I'm getting infinite recursion when I try and parse simple expressions (e.g. regular_expression.parseString("a")
). Is there something wrong with my definition of concatenation?
For reference, I'm trying to adapt this grammar.
Upvotes: 1
Views: 527
Reputation: 63782
The issue you are looking at at the moment (infinite recursion) is due to left-recursion in your grammar. pyparsing is purely left-to-right parsing, with no lookahead (unless you do it explicitly in your grammar). So you if define a list as:
expr ::= expr | expr
then it will just spiral down into the dirt.
I think part of this is that your parser is kind of all over the place, with many redundant recursive terms. If you can stop and think about the definition of what you are parsing, and even capture it into some simplified BNF, it should help clarify your thinking.
Here's what I got:
Or a little more formally:
re_item ::= (character | '(' re_expr ')') [repetition]
re_item_seq ::= re_item+
repetition ::= '*' | '?'
re_expr ::= re_item_seq [ '|' re_item_seq ]...
There is no left-recursion in this parser, since the re_expr can only be recursed into after matching an opening parenthesis.
Translating to pyparsing is almost verbatim:
from pyparsing import (alphas, Word, Forward, oneOf, OneOrMore, Optional,
delimitedList, Group)
character = Word(alphas, exact=1).setName("character") # <- use 'exact', not 'max'
regular_expression = Forward()
group = Group('(' + regular_expression + ')')
repetition = oneOf("* ?")
re_item = Group((character | group) + repetition) | character | group
re_item_seq = OneOrMore(re_item)
regular_expression <<= delimitedList(re_item_seq, '|')
Testing this out:
regular_expression.runTests("""
a
a?
sdlfj*(b|c)?
""")
gives:
a
['a']
a?
[['a', '?']]
[0]:
['a', '?']
sdlfj*(b|c)?
['s', 'd', 'l', 'f', ['j', '*'], [['(', 'b', 'c', ')'], '?']]
[0]:
s
[1]:
d
[2]:
l
[3]:
f
[4]:
['j', '*']
[5]:
[['(', 'b', 'c', ')'], '?']
[0]:
['(', 'b', 'c', ')']
[1]:
?
TL;DR - read up on "left recursion", also you can visit this re parsing example (which inverts an re into a list of candidate strings that the re could match): http://pyparsing.wikispaces.com/file/view/invRegex.py/111533811/invRegex.py
Upvotes: 2