Reputation: 55
I noticed a weird side-effect in pyparsing:
When using .ignore() on a superset of a parser, parseString(... , parseAll = True) stops examining the whole string at the comment symbol. Better explained by code below.
How do I fix that without using stringEnd?
example:
def test():
import pyparsing as p
unquoted_exclude = "\\\"" + "':/|<>,;#"
unquoted_chars = ''.join(set(p.printables) - set(unquoted_exclude))
unquotedkey = p.Word(unquoted_chars)
more = p.OneOrMore(unquotedkey)
more.ignore("#" + p.restOfLine)
# ^^ "more" should ignore comments, but not "unquotedkey" !!
def parse(parser, input_, parseAll=True):
try:
print input_
print parser.parseString(input_, parseAll).asList()
except Exception as err:
print err
parse(unquotedkey, "abc#d")
parse(unquotedkey, "abc|d")
withstringend = unquotedkey + p.stringEnd
parse(withstringend, "abc#d", False)
parse(withstringend, "abc|d", False)
Output:
abc#d ['abc'] <--- should throw an exception but does not abc|d Expected end of text (at char 3), (line:1, col:4) abc#d Expected stringEnd (at char 3), (line:1, col:4) abc|d Expected stringEnd (at char 3), (line:1, col:4)
Upvotes: 4
Views: 2773
Reputation: 63709
To compare apples to apples, you should also add this line after defining withstringend
:
withstringend.ignore('#' + p.restOfLine)
I think you'll see it has the same behavior as your test of parsing with unquotedKey
.
The purpose of ignore
is to ignore a construct anywhere within a parsed input text, not just at the topmost level. For example, in a C program, you don't just ignore comments between statements:
/* add one to x */
x ++;
You also have to ignore comments that might appear anywhere:
x /* this is a post-increment
so it really won't add 1 to x until after the
statement executes */ ++
/* and this is the trailing semicolon
for the previous statement -> */;
Or perhaps a little less contrived:
for (x = ptr; /* start at ptr */
*x; /* keep going as long as we point to non-zero */
x++ /* add one to x */ )
So to support this, ignore()
is implemented to recurse through the entire defined parser and update the list of ignorable expressions on every sub parser in the overall parser, so that ignorables are skipped over at every level of the overall parser. The alternative would be sprinkle calls to ignore
all over your parser definition, and constantly try to chase down those that were accidentally skipped over.
So in your first case, when you did:
more = p.OneOrMore(unquotedKey)
more.ignore('#' + p.restOfline)
you also updated the ignorables for unquotedKey
. If you want to isolate unquotedKey
so that it does not get this side-effect, then define more
using:
more = p.OneOrMore(unquotedKey.copy())
One other point - your definition of an unquoted key by defining a key as "everything in printables except for these special characters". The technique you use was good up until version 1.5.6, when the excludeChars
argument was added to the Word class. Now you don't have to mess around building the list of only the allowed characters, you can have Word do the work. Try:
unquotedKey = p.Word(p.printables,
excludeChars = r'\"' + "':/|<>,;#")
Upvotes: 4