Reputation: 6848
My inputs are given like key: "a word"
or like anotherkey: "a word (1234)"
. My issue is that I have used below syntax:
word = pp.Word(pp.printables, excludeChars=":")
word = ("[" + pp.Word(pp.printables + " ", excludeChars=":[]") + "]") | word
non_tag = word + ~pp.FollowedBy(":")
# tagged value is two words with a ":"
tag = pp.Group(word + ":" + word)
# one or more non-tag words - use originalTextFor to get back
# a single string, including intervening white space
phrase = pp.originalTextFor(non_tag[1, ...])
parser = (phrase | tag)[...]
When my inputs are like key: "value1" and hey you how are you?
it translates the query to expected output which is ([(['key', ':', '"value1"'], {}), 'and hey you how are you?'], {})
, but problem occures when I try to have space between my value after key:
parser.parseString('key: "Microsoft windows (12932)" and hey you how are you?')
([(['key', ':', '"Microsoft'], {}), 'windows (12932)" and hey you how are you?'], {})
It breaks on Microsoft
and windows
. I know `pyparsing ignores spaces, but how can I solve this issue and get results until the end of the phrase which is double quotes?
EDIT-1 I tried to work around this problem by adding another word like below:
word = ('"' + pp.Word(pp.printables + " ", excludeChars=':"') + '"') | word
It works on queries like key: "windows server (23232)"
but not on more complex queries like key1: value and key2: "windows server (1212)"
. Anyone has any clue about this issue and how should I circumvent this buggy behavior?
EDIT-2 What do I expect? What I need is to extend my grammar so something like below query:
'key: "Microsoft windows (12932)" and hey you how are you?
It should NOT be:
([(['key', ':', '"Microsoft'], {}), 'windows (12932)" and hey you how are you?'], {})
IT should be like:
([(['key', ':', '"Microsoft windows (12932)"'], {}), 'and hey you how are you?'], {})
This query can get combined with more keys with a free text search like below:
A free text search and key1: "Microsoft windows (12312) and key2: "Sample2" or key3: "Another sample (121212)"
This should also get parsed like below:
part1-> A free text search and
part2: ['key1', ':', '"Microsoft windows (12932)"']
part3: ['key2', ':', '"Sample2"']
part3: ['key3', ':', '"Another sample (121212)"']
NOTE: if and
, or
is attached to tokens it is OK for me. I just need to separate free text search from key:value queries.
Upvotes: 3
Views: 194
Reputation: 63719
I generally discourage people from writing Word
s that include spaces as valid word characters.
Doing so disables most lookahead rules or matching for keywords. That is why "and" and "or" get included
in your search term even though they probably should be logical operators.
If this is supposed to be a search string, then start by writing a BNF for doing search:
word := group of any non-whitespace characters, excluding '":[]'
non_tag := word ~":"
tagged_value := word ':' (quoted_string | word)
phrase := non_tag...
search_term := quoted_string | tag | phrase | '[' search_expr ']'
search_expr_not := NOT? search_term
search_expr_and := search_expr_not ['and' search_expr_not]...
search_expr_or := search_expr_and ['or' search_expr_and]...
search_expr := search_expr_or
This reuses several of the expressions just as you defined them. You were definitely
on the right track with some of your expressions like non_tag and phrase. Where things
went bad was when you tried to handle quoted strings by just extending your word
expression.
We also need to define word in such a way that it won't match any of the operator keywords "and", "or", or "not". So we start by creating expressions for them:
AND, OR, NOT = map(pp.CaselessKeyword, "and or not".split())
any_keyword = AND | OR | NOT
We will also define an expression to handle quoted strings specifically
(instead of adding " " and '"' to word
):
quoted_string = pp.QuotedString('"')
Here is the first part of the BNF translated to a pyparsing parser:
COLON = pp.Suppress(":")
word = pp.Combine(~any_keyword + pp.Word(pp.printables, excludeChars=':"\'[]'))
non_tag = word + ~pp.FollowedBy(":")
phrase = pp.originalTextFor(non_tag[1, ...])
# tagged value is a word followed by a ":" and a quoted string or phrase
tagged_value = pp.Group(word + COLON + (quoted_string | phrase))
Then, to tie things together using "and", "or", and "not" as operators (the last part of the BNF), we use
pyparsing's infixNotation
method. It looks like you want to use "[]"s as grouping
characters, so we can specify them as overrides to the default "()" grouping characters.
We start by defining what a search term looks like, using the expressions from the BNF:
search_term = quoted_string | tagged_value | phrase
Then use infixNotation to define what a search expression looks like using that term:
search_expr = pp.infixNotation(search_term,
[
(NOT, 1, pp.opAssoc.RIGHT),
(AND, 2, pp.opAssoc.LEFT),
(OR, 2, pp.opAssoc.LEFT),
],
lpar="[", rpar="]")
Using search_expr
as your parser, here is the result from parsing your test string:
parser = search_expr
tests = """\
A free text search and key1: "Microsoft windows (12312)" and key2: "Sample2" or key3: "Another sample (121212)"
key: "Microsoft windows (12932)" and hey you how are you?
"""
parser.runTests(tests)
Prints:
A free text search and key1: "Microsoft windows (12312)" and key2: "Sample2" or key3: "Another sample (121212)"
[[['A free text search', 'and', ['key1', 'Microsoft windows (12312)'], 'and', ['key2', 'Sample2']], 'or', ['key3', 'Another sample (121212)']]]
[0]:
[['A free text search', 'and', ['key1', 'Microsoft windows (12312)'], 'and', ['key2', 'Sample2']], 'or', ['key3', 'Another sample (121212)']]
[0]:
['A free text search', 'and', ['key1', 'Microsoft windows (12312)'], 'and', ['key2', 'Sample2']]
[0]:
A free text search
[1]:
and
[2]:
['key1', 'Microsoft windows (12312)']
[3]:
and
[4]:
['key2', 'Sample2']
[1]:
or
[2]:
['key3', 'Another sample (121212)']
key: "Microsoft windows (12932)" and hey you how are you?
[[['key', 'Microsoft windows (12932)'], 'and', 'hey you how are you?']]
[0]:
[['key', 'Microsoft windows (12932)'], 'and', 'hey you how are you?']
[0]:
['key', 'Microsoft windows (12932)']
[1]:
and
[2]:
hey you how are you?
To actually evaluate these parsed results, please consult the simpleBool.py example in the pyparsing examples directory.
Upvotes: 3