Reputation: 5686
I'm trying to split a string like aaa:bbb(123)
into tokens using Pyparsing.
I can do this with regular expression, but I need to do it via Pyparsing.
With re
the solution will look like:
>>> import re
>>> string = 'aaa:bbb(123)'
>>> regex = '(\S+):(\S+)\((\d+)\)'
>>> re.match(regex, string).groups()
('aaa', 'bbb', '123')
This is clear and simple enough. The key point here is \S+
which means "everything except whitespaces".
Now I'll try to do this with Pyparsing:
>>> from pyparsing import Word, Suppress, nums, printables
>>> expr = (
... Word(printables, excludeChars=':')
... + Suppress(':')
... + Word(printables, excludeChars='(')
... + Suppress('(')
... + Word(nums)
... + Suppress(')')
... )
>>> expr.parseString(string).asList()
['aaa', 'bbb', '123']
Okay, we've got the same result, but this does not look good. We've set excludeChars
to make Pyparsing expressions to stop where we need, but this doesn't look robust. If we will have "excluded" chars in source string, same regex will work fine:
>>> string = 'a:aa:b(bb(123)'
>>> re.match(regex, string).groups()
('a:aa', 'b(bb', '123')
while Pyparsing exception will obviously break:
>>> expr.parseString(string).asList()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/long/path/to/pyparsing.py", line 1111, in parseString
raise exc
ParseException: Expected W:(0123...) (at char 7), (line:1, col:8)
So, the question is how can we implement needed logic with Pyparsing?
Upvotes: 5
Views: 569
Reputation: 63762
Unlike regex, pyparsing is purely left-to-right seeking, with no implicit lookahead.
If you want regex's lookahead and backtracking, you could just use a Regex containing your original re:
expr = Regex(r"(\S+):(\S+)\((\d+)\)")
print expr.parseString(string).dump()
['aaa:b(bb(123)']
However, I see that this returns just the whole match as a single string. If you want to be able to access the individual groups, you'll have to define them as named groups:
expr = Regex(r"(?P<field1>\S+):(?P<field2>\S+)\((?P<field3>\d+)\)")
print expr.parseString(string).dump()
['aaa:b(bb(123)']
- field1: aaa
- field2: b(bb
- field3: 123
This suggests to me that a good enhancement would be to add a constructor arg to Regex to return the results as a list of all the re groups rather than the string.
Upvotes: 2
Reputation: 2678
Use a regex with a look-ahead assertion:
from pyparsing import Word, Suppress, Regex, nums, printables
expr = (
Word(printables, excludeChars=':')
+ Suppress(':')
+ Regex(r'\S+[^\(](?=\()')
+ Suppress('(')
+ Word(nums)
+ Suppress(')')
)
Upvotes: 1