john-cd
john-cd

Reputation: 55

pyparsing: when ignoring comments, parseAll=True does not throw a ParseException

I noticed a weird side-effect in pyparsing:

When using .ignore() on a superset of a parser, parseString(... , parseAll = True) stops examining the whole string at the comment symbol. Better explained by code below.

How do I fix that without using stringEnd?

example:

def test():        
    import pyparsing as p
    unquoted_exclude = "\\\"" + "':/|<>,;#"   
    unquoted_chars = ''.join(set(p.printables) - set(unquoted_exclude))
    unquotedkey = p.Word(unquoted_chars)

    more = p.OneOrMore(unquotedkey)
    more.ignore("#" + p.restOfLine) 
    # ^^ "more" should ignore comments, but not "unquotedkey" !!

    def parse(parser, input_, parseAll=True):
        try: 
            print input_
            print parser.parseString(input_, parseAll).asList()
        except Exception as err:
            print err


    parse(unquotedkey, "abc#d")
    parse(unquotedkey, "abc|d")

    withstringend = unquotedkey + p.stringEnd 
    parse(withstringend, "abc#d", False)
    parse(withstringend, "abc|d", False)

Output:

abc#d     
['abc'] <--- should throw an exception but does not  
abc|d
Expected end of text (at char 3), (line:1, col:4)
abc#d
Expected stringEnd (at char 3), (line:1, col:4)
abc|d
Expected stringEnd (at char 3), (line:1, col:4)

Upvotes: 4

Views: 2773

Answers (1)

PaulMcG
PaulMcG

Reputation: 63709

To compare apples to apples, you should also add this line after defining withstringend:

withstringend.ignore('#' + p.restOfLine)

I think you'll see it has the same behavior as your test of parsing with unquotedKey.

The purpose of ignore is to ignore a construct anywhere within a parsed input text, not just at the topmost level. For example, in a C program, you don't just ignore comments between statements:

/* add one to x */
x ++;

You also have to ignore comments that might appear anywhere:

x /* this is a post-increment 
so it really won't add 1 to x until after the
statement executes */ ++
/* and this is the trailing semicolon 
for the previous statement -> */;

Or perhaps a little less contrived:

for (x = ptr; /* start at ptr */
     *x; /* keep going as long as we point to non-zero */
     x++ /* add one to x */ )

So to support this, ignore() is implemented to recurse through the entire defined parser and update the list of ignorable expressions on every sub parser in the overall parser, so that ignorables are skipped over at every level of the overall parser. The alternative would be sprinkle calls to ignore all over your parser definition, and constantly try to chase down those that were accidentally skipped over.

So in your first case, when you did:

more = p.OneOrMore(unquotedKey)
more.ignore('#' + p.restOfline)

you also updated the ignorables for unquotedKey. If you want to isolate unquotedKey so that it does not get this side-effect, then define more using:

more = p.OneOrMore(unquotedKey.copy())

One other point - your definition of an unquoted key by defining a key as "everything in printables except for these special characters". The technique you use was good up until version 1.5.6, when the excludeChars argument was added to the Word class. Now you don't have to mess around building the list of only the allowed characters, you can have Word do the work. Try:

unquotedKey = p.Word(p.printables,
                     excludeChars = r'\"' + "':/|<>,;#")

Upvotes: 4

Related Questions