Pyparsing: dblQuotedString parsing differently in nestedExpr

Question

I'm working on a grammar to parse search queries (not evaluate them, just break them into component pieces). Right now I'm working with nestedExpr, just to grab the different 'levels' of each term, but I seem to have an issue if the first part of a term is in double quotes.

A simple version of the grammar:

QUOTED = QuotedString(quoteChar = '“', endQuoteChar = '”', unquoteResults = False).setParseAction(remove_curlies)
WWORD = Word(alphas8bit + printables.replace("(", "").replace(")", ""))
WORDS = Combine(OneOrMore(dblQuotedString | QUOTED | WWORD), joinString = ' ', adjacent = False)
TERM = OneOrMore(WORDS)
NESTED = OneOrMore(nestedExpr(content = TERM))

query = '(dog* OR boy girl w/3 ("girls n dolls" OR friends OR "best friend" OR (friends w/10 enemies)))'

Calling NESTED.parseString(query) returns:

[['dog* OR boy girl w/3', ['"girls n dolls"', 'OR friends OR "best friend" OR', ['friends w/10 enemies']]]]

The first dblQuotedString instance is separate from the rest of the term at the same nesting, which doesn't occur to the second dblQuotedString instance, and also doesn't occur if the quoted bit is a QUOTED instance (with curly quotes) rather than dblQuotedString with straight ones.

Is there something special about dblQuotedString that I'm missing?

NOTE: I know that operatorPrecedence can break up search terms like this, but I have some limits on what can be broken apart, so I'm testing if I can use nestedExpr to work within those limits.

PaulMcG · Accepted Answer

nestedExpr takes an optional keyword argument ignoreExpr, to take an expression that nestedExpr should use to ignore characters that would otherwise be interpreted as nesting openers or closers, and the default is pyparsing's quotedString, which is defined as sglQuotedString | dblQuotedString. This is to handle strings like:

(this has a tricky string "string with )" )

Since the default ignoreExpr is quotedString, the ')' in quotes is not misinterpreted as the closing parenthesis.

However, your content argument also matches on dblQuotedString. The leading quoted string is matched internally by nestedExpr by way of skipping over quoted strings that may contain "()"s, then your content is matched, which also matches quoted strings. You can suppress nestedExpr's ignore expression using a NoMatch:

NESTED = OneOrMore(nestedExpr(content = TERM, ignoreExpr=NoMatch()))

which should now give you:

[['dog* OR boy girl w/3',
 ['"girls n dolls" OR friends OR "best friend" OR', ['friends w/10 enemies']]]]

You'll find more details and examples at https://pythonhosted.org/pyparsing/pyparsing-module.html#nestedExpr

Pyparsing: dblQuotedString parsing differently in nestedExpr

Answers (1)

Related Questions