Reputation: 11440
I wanted to match an expression which is looking like this:
(<some value with spaces and m$1124any crazy signs> (<more values>) <even more>)
I simply want to split those values along the round brackets (). Currently, I could reduce the pyparsing overhead in the s-expression examplewhich is far to extensive and not understandable (IMHO).
I got as far as to use the nestedExpr
statement, which reduces it to one line:
import pyparsing as pp
parser = pp.nestedExpr(opener='(', closer=')')
print parser.parseString(example, parseAll=True).asList()
The result also appears to be split at the white spaces, which I do not want:
skewed_output = [['<some',
'value',
'with',
'spaces',
'and',
'm$1124any',
'crazy',
'signs>',
['<more', 'values>'],
'<even',
'more>']]
expected_output = [['<some value with spaces and m$1124any crazy signs>'
['<more values>'], '<even more>']]
best_output = [['some value with spaces and m$1124any crazy signs'
['more vlaues'], 'even more']]
Optionally, I'd gladly take any points to where I can read some understandable introduction as how to include a more detailed parser (I'd like to extract the value between the < > brackets and match them (see best_output
), but I can always string.strip()
them afterwards.
Thanks in advance!
Upvotes: 5
Views: 3175
Reputation: 9238
Pyparsing's nestedExpr
accepts content
and ignoreExpr
arguments which specify what is a "single item" of an s-expr. You can pass QuotedString
here. Unfortunately, I did not understand the difference between two parameters from docs well enough, but some experiments showed me that the following code should satisfy your requirements:
import pyparsing as pp
single_value = pp.QuotedString(quoteChar="<", endQuoteChar=">")
parser = pp.nestedExpr(opener="(", closer=")",
content=single_value,
ignoreExpr=None)
example = "(<some value with spaces and m$1124any crazy signs> (<more values>) <even more>)"
print(parser.parseString(example, parseAll=True))
Output:
[['some value with spaces and m$1124any crazy signs', ['more values'], 'even more']]
It expects list to start with (
, end with )
, and contain some optionally-whitespace-separated lists or quoted strings, each quoted string should start with <
, end with >
and do not contain <
inside.
You can play around with content
and ignoreExpr
parameters more to find out that content=None, ignoreExpr=single_value
makes the parse accept both quoted and unquoted strings (and separate unquoted strings with spaces):
import pyparsing as pp
single_value = pp.QuotedString(quoteChar="<", endQuoteChar=">")
parser = pp.nestedExpr(opener="(", closer=")", ignoreExpr=single_value, content=None)
example = "(<some value with spaces and m$1124any crazy signs> (<more values>) <even m<<ore> foo (foo) <(foo)>)"
print(parser.parseString(example, parseAll=True))
Output:
[['some value with spaces and m$1124any crazy signs', ['more values'], 'even m<<ore', 'foo', ['foo'], '(foo)']]
Some questions left open:
pyparsing
ignore whitespaces between consecutive list items?content
and ignoreExpr
and when one should use each of them?Upvotes: 7