lanour
lanour

Reputation: 123

Pyparsing - Rule Ambiguity

I am writing a Pyparsing grammar to convert Creole markup to HTML. I'm stuck because there's a bit of conflict trying to parse these two constructs:

Image link: {{image.jpg|title}}
Ignore formatting: {{{text}}}

The way I'm parsing the image link is as follows (note that this converts perfectly fine):

def parse_image(s, l, t):
    try:
        link, title = t[0].split("|")
    except ValueError:
        raise ParseFatalException(s,l,"invalid image link reference: " + t[0])
    return '<img src="{0}" alt="{1}" />'.format(link, title)

image = QuotedString("{{", endQuoteChar="}}")
image.setParseAction(parse_image)

Next, I wrote a rule so that when {{{text}}} is encountered, simply return what's between the opening and closing braces without formatting it:

n = QuotedString("{{{", endQuoteChar="}}}")
n.setParseAction(lambda x: x[0])

However, when I try to run the following test case:

text = italic | bold | hr | newline | image | n
print text.transformString("{{{ //ignore formatting// }}}")

I get the following stack trace:

Traceback (most recent call last):
File "C:\Users\User\py\kreyol\parser.py", line 36, in <module>
print text.transformString("{{{ //ignore formatting// }}}")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1210, in transformString
raise exc
pyparsing.ParseFatalException: invalid image link reference: { //ignore formatting//  (at char 0), (line:1, col:1)

From what I understand, the parser encounters the {{ first and tries to parse the text as an image instead of text without formatting. How can I solve this ambiguity?

Upvotes: 1

Views: 323

Answers (1)

PaulMcG
PaulMcG

Reputation: 63709

The issue is with this expression:

text = italic | bold | hr | newline | image | n

Pyparsing works strictly left-to-right, with no lookahead. Using '|' operators, you construct a pyparsing MatchFirst expression, which will match the first match of all the alternatives, even if a later match is better.

You can change the evaluation to use "longest match" by using the '^' operator instead:

text = italic ^ bold ^ hr ^ newline ^ image ^ n

This would have a performance penalty in that every expression is tested, even though there is no possibility of a better match.

An easier solution is to just reorder the expressions in your list of alternatives: test for n before image:

text = italic | bold | hr | newline | n | image

Now when evaluating alternatives, it will look for the leading {{{ of n before the leading {{ of image.

This often crops up when people define numeric terms, and accidentally define something like:

integer = Word(nums)
realnumber = Combine(Word(nums) + '.' + Word(nums))
number = integer | realnumber

In this case, number will never match a realnumber, since the leading whole number part will be parsed as an integer. The fix, as in your case, is to either use '^' operator, or just reorder:

number = realnumber | integer

Upvotes: 3

Related Questions