Reputation: 33
I'm working on a grammar to parse search queries (not evaluate them, just break them into component pieces). Right now I'm working with nestedExpr
, just to grab the different 'levels' of each term, but I seem to have an issue if the first part of a term is in double quotes.
A simple version of the grammar:
QUOTED = QuotedString(quoteChar = '“', endQuoteChar = '”', unquoteResults = False).setParseAction(remove_curlies)
WWORD = Word(alphas8bit + printables.replace("(", "").replace(")", ""))
WORDS = Combine(OneOrMore(dblQuotedString | QUOTED | WWORD), joinString = ' ', adjacent = False)
TERM = OneOrMore(WORDS)
NESTED = OneOrMore(nestedExpr(content = TERM))
query = '(dog* OR boy girl w/3 ("girls n dolls" OR friends OR "best friend" OR (friends w/10 enemies)))'
Calling NESTED.parseString(query)
returns:
[['dog* OR boy girl w/3', ['"girls n dolls"', 'OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
The first dblQuotedString
instance is separate from the rest of the term at the same nesting, which doesn't occur to the second dblQuotedString
instance, and also doesn't occur if the quoted bit is a QUOTED
instance (with curly quotes) rather than dblQuotedString
with straight ones.
Is there something special about dblQuotedString
that I'm missing?
NOTE: I know that operatorPrecedence
can break up search terms like this, but I have some limits on what can be broken apart, so I'm testing if I can use nestedExpr
to work within those limits.
Upvotes: 3
Views: 476
Reputation: 63762
nestedExpr
takes an optional keyword argument ignoreExpr
, to take an expression that nestedExpr
should use to ignore characters that would otherwise be interpreted as nesting openers or closers, and the default is pyparsing's quotedString
, which is defined as sglQuotedString | dblQuotedString
. This is to handle strings like:
(this has a tricky string "string with )" )
Since the default ignoreExpr
is quotedString
, the ')' in quotes is not misinterpreted as the closing parenthesis.
However, your content
argument also matches on dblQuotedString
. The leading quoted string is matched internally by nestedExpr
by way of skipping over quoted strings that may contain "()"s, then your content is matched, which also matches quoted strings. You can suppress nestedExpr
's ignore expression using a NoMatch
:
NESTED = OneOrMore(nestedExpr(content = TERM, ignoreExpr=NoMatch()))
which should now give you:
[['dog* OR boy girl w/3',
['"girls n dolls" OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
You'll find more details and examples at https://pythonhosted.org/pyparsing/pyparsing-module.html#nestedExpr
Upvotes: 3