rookie
rookie

Reputation: 2923

Parsing regular expressions with pyparsing

I'm trying to write a simplified regular expression parser (only support the * and | operators in addition to concatenation) using pyparsing. Here's my grammar so far:

from pyparsing import alphas, Word, Forward

regular_expression = Forward()

character = Word(alphas, max=1) 
group     = '(' + regular_expression + ')'
star      = (character | group) + '*'

# A 'concat_expression' is any string concat of the above
# String concat with a 'concat_expression' also produces a 'concat_expression'
concat_expression = Forward()
concat_expression << ((character | group | star | concat_expression) +
                      (character | group | star))

or_expression = regular_expression + '|' + regular_expression

regular_expression << or_expression | concat_expression

I'm getting infinite recursion when I try and parse simple expressions (e.g. regular_expression.parseString("a")). Is there something wrong with my definition of concatenation?

For reference, I'm trying to adapt this grammar.

Upvotes: 1

Views: 527

Answers (1)

PaulMcG
PaulMcG

Reputation: 63782

The issue you are looking at at the moment (infinite recursion) is due to left-recursion in your grammar. pyparsing is purely left-to-right parsing, with no lookahead (unless you do it explicitly in your grammar). So you if define a list as:

expr ::= expr | expr

then it will just spiral down into the dirt.

I think part of this is that your parser is kind of all over the place, with many redundant recursive terms. If you can stop and think about the definition of what you are parsing, and even capture it into some simplified BNF, it should help clarify your thinking.

Here's what I got:

  • an expression can be made up of pieces, each of which is a single character or an expression inside ()'s, optionally followed by some repetition indicator ('*' or '?' or and integer in {}'s, etc.)
  • these pieces may be run together next to each other
  • these pieces may be split with '|' to indicate alternatives

Or a little more formally:

re_item     ::= (character | '(' re_expr ')') [repetition]
re_item_seq ::= re_item+
repetition  ::= '*' | '?'
re_expr     ::= re_item_seq [ '|' re_item_seq ]...

There is no left-recursion in this parser, since the re_expr can only be recursed into after matching an opening parenthesis.

Translating to pyparsing is almost verbatim:

from pyparsing import (alphas, Word, Forward, oneOf, OneOrMore, Optional, 
    delimitedList, Group)

character = Word(alphas, exact=1).setName("character") # <- use 'exact', not 'max'

regular_expression = Forward()
group     = Group('(' + regular_expression + ')')
repetition = oneOf("* ?")

re_item = Group((character | group) + repetition) | character | group
re_item_seq = OneOrMore(re_item)

regular_expression <<= delimitedList(re_item_seq, '|')

Testing this out:

regular_expression.runTests("""
    a
    a?
    sdlfj*(b|c)?
    """)

gives:

a
['a']


a?
[['a', '?']]
[0]:
  ['a', '?']


sdlfj*(b|c)?
['s', 'd', 'l', 'f', ['j', '*'], [['(', 'b', 'c', ')'], '?']]
[0]:
  s
[1]:
  d
[2]:
  l
[3]:
  f
[4]:
  ['j', '*']
[5]:
  [['(', 'b', 'c', ')'], '?']
  [0]:
    ['(', 'b', 'c', ')']
  [1]:
    ?

TL;DR - read up on "left recursion", also you can visit this re parsing example (which inverts an re into a list of candidate strings that the re could match): http://pyparsing.wikispaces.com/file/view/invRegex.py/111533811/invRegex.py

Upvotes: 2

Related Questions