Brad Solomon
Brad Solomon

Reputation: 40878

re.VERBOSE and lookahead assertion error

I have a verbose (re.X) flagged regex that is throwing an exception, even though it seems to be equivalent to its condensed version. (I built the former from the latter.)

Condensed version:

import re
test = 'catdog'
test2 = 'dogcat'
pat = re.compile(r'(?=\b\w{6}\b)\b\w*cat\w*\b')

print(pat.search(test))
print(pat.search(test2))
# catdog Match object
# dogcat Match object

Verbose version:

pat = re.compile(r"""(               # Start of group (lookahead); need raw string
                     ?=              # Positive lookahead; notation = `q(?=u)`
                     \b\w{6}\b       # Word boundary and 6 alphanumeric characters
                     )               # End of group (lookahead)
                     \b\w*cat\w*\b   # Literal 'cat' in between 0 or more alphanumeric""", re.X)
print(pat.search(test).string)
print(pat.search(test2).string)

# Throws exception
# error: nothing to repeat at position 83 (line 2, column 22)

What's causing this? I can't find why the expanded version is violating any condition for re.X/re.VERBOSE. From docs:

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

There are no character classes or whitespace preceded by unescaped backslashes, as far as I can tell.

Upvotes: 2

Views: 227

Answers (2)

user2357112
user2357112

Reputation: 280788

This is Python issue 15606. re's behavior with whitespace inside a token in verbose mode doesn't match the documentation. You can't put whitespace in the middle of (?=.

Upvotes: 3

JBernardo
JBernardo

Reputation: 33397

The issue is with ?= on the second line. The ? can mean multiple things like [ ]? which is 0 or 1 spaces which I believe is the case for the whitespace preceding it. Whitespace is ignored but it is making the two chars ( and ? into separated entities.

Move the ?= to the 1st line and it will work. Like (?=

The error

error: nothing to repeat at position 83

Makes it pretty clear that ? is here being interpreted as repetition

Upvotes: 2

Related Questions