Piotr Migdal
Piotr Migdal

Reputation: 12872

Python regex - why does end of string ($ and \Z) not work with group expressions?

In Python 2.6. it seems that markers of the end of string $ and \Z are not compatible with group expressions. Fo example

import re
re.findall("\w+[\s$]", "green pears")

returns

['green ']

(so $ effectively does not work). And using

re.findall("\w+[\s\Z]", "green pears")

results in an error:

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in findall(pattern, string, flags)
    175 
    176     Empty matches are included in the result."""
--> 177     return _compile(pattern, flags).findall(string)
    178 
    179 if sys.hexversion >= 0x02020000:

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in _compile(*key)
    243         p = sre_compile.compile(pattern, flags)
    244     except error, v:
--> 245         raise error, v # invalid expression
    246     if len(_cache) >= _MAXCACHE:
    247         _cache.clear()

error: internal: unsupported set operator

Why does it work that way and how to go around?

Upvotes: 20

Views: 36848

Answers (3)

Junji Zhi
Junji Zhi

Reputation: 1480

Martijn Pieters' answer is correct. To elaborate a bit, if you use capturing groups

r"\w+(\s|$)"

you get:

>>> re.findall("\w+(\s|$)", "green pears")
[' ', '']

That's because re.findall() returns the captured group (\s|$) values.

Parentheses () are used for two purposes: character groups and captured groups. To disable captured groups but still act as character groups, use (?:...) syntax:

>>> re.findall("\w+(?:\s|$)", "green pears")
['green ', 'pears']

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1125058

A [..] expression is a character group, meaning it'll match any one character contained therein. You are thus matching a literal $ character. A character group always applies to one input character, and thus can never contain an anchor.

If you wanted to match either a whitespace character or the end of the string, use a non-capturing group instead, combined with the | or selector:

r"\w+(?:\s|$)"

Alternatively, look at the \b word boundary anchor. It'll match anywhere a \w group start or ends (so it anchors to points in the text where a \w character is preceded or followed by a \W character, or is at the start or end of the string).

Upvotes: 38

BrenBarn
BrenBarn

Reputation: 251598

Square brackets don't indicate a group, they indicate a character set, which matches one character (any one of those in the brackets) As documented, "special characters lose their special meaning inside sets" (except where indicated otherwise as with classes like \s).

If you want to match \s or end of string, use something like \s|$.

Upvotes: 4

Related Questions