o3bvv
o3bvv

Reputation: 5686

Greedy expressions in Pyparsing

I'm trying to split a string like aaa:bbb(123) into tokens using Pyparsing.

I can do this with regular expression, but I need to do it via Pyparsing.

With re the solution will look like:

>>> import re
>>> string = 'aaa:bbb(123)'
>>> regex = '(\S+):(\S+)\((\d+)\)'
>>> re.match(regex, string).groups()
('aaa', 'bbb', '123')

This is clear and simple enough. The key point here is \S+ which means "everything except whitespaces".

Now I'll try to do this with Pyparsing:

>>> from pyparsing import Word, Suppress, nums, printables
>>> expr = (
...     Word(printables, excludeChars=':')
...     + Suppress(':')
...     + Word(printables, excludeChars='(')
...     + Suppress('(')
...     + Word(nums)
...     + Suppress(')')
... )
>>> expr.parseString(string).asList()
['aaa', 'bbb', '123']

Okay, we've got the same result, but this does not look good. We've set excludeChars to make Pyparsing expressions to stop where we need, but this doesn't look robust. If we will have "excluded" chars in source string, same regex will work fine:

>>> string = 'a:aa:b(bb(123)'
>>> re.match(regex, string).groups()
('a:aa', 'b(bb', '123')

while Pyparsing exception will obviously break:

>>> expr.parseString(string).asList()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/long/path/to/pyparsing.py", line 1111, in parseString
    raise exc
ParseException: Expected W:(0123...) (at char 7), (line:1, col:8)

So, the question is how can we implement needed logic with Pyparsing?

Upvotes: 5

Views: 569

Answers (2)

PaulMcG
PaulMcG

Reputation: 63762

Unlike regex, pyparsing is purely left-to-right seeking, with no implicit lookahead.

If you want regex's lookahead and backtracking, you could just use a Regex containing your original re:

expr = Regex(r"(\S+):(\S+)\((\d+)\)")
print expr.parseString(string).dump()

['aaa:b(bb(123)']

However, I see that this returns just the whole match as a single string. If you want to be able to access the individual groups, you'll have to define them as named groups:

expr = Regex(r"(?P<field1>\S+):(?P<field2>\S+)\((?P<field3>\d+)\)")
print expr.parseString(string).dump()

['aaa:b(bb(123)']
- field1: aaa
- field2: b(bb
- field3: 123    

This suggests to me that a good enhancement would be to add a constructor arg to Regex to return the results as a list of all the re groups rather than the string.

Upvotes: 2

blaze
blaze

Reputation: 2678

Use a regex with a look-ahead assertion:

from pyparsing import Word, Suppress, Regex, nums, printables

expr = (
     Word(printables, excludeChars=':')
     + Suppress(':')
     + Regex(r'\S+[^\(](?=\()')
     + Suppress('(')
     + Word(nums)
     + Suppress(')')
 )

Upvotes: 1

Related Questions