Rodrigo Caetano
Rodrigo Caetano

Reputation: 49

Understanding the (\D\d)+ Regex pattern in Python

I'm spending some time trying to understand the Regex terminology in Python 3 and I can't figure out how (\D\d)+ works here.

I know that \D represents a nondigit character and that \d represents a digit character, and also that the plus sign + represents one or more repetitions of the preceding expression. But when I try the following code, I simply can't wrap my head around the result.

Input:

import re
text = "a1 b2 c3 d4e5f6"
regex = re.findall(r'(\D\d)+',text)
print(regex)

Output:

['a1', 'b2', 'c3', 'f6']

Since that the regex includes a plus sign, shouldn't it also output d4e5f6 as they are a sequence of nondigit and digit characters?

Upvotes: 3

Views: 2660

Answers (1)

jasonharper
jasonharper

Reputation: 9597

You aren't directly repeating the \D\d subpattern with the +, you are repeating a capturing group (indicated by parentheses) that contains that subpattern. The final match is indeed of the text d4e5f6, but it does so as three instances of the capturing group, each one of which overwrites the last. And the behavior of Python's re.findall() in the presence of capturing groups is that it returns THEM (as a tuple, if there's more than one) instead of the overall match.

There is a newer, experimental regex module in Python 3.x that is capable of returning multiple matches for a single capturing group, although I'm not exactly sure how that interacts with findall(). You could also write the expression as (?:\D\d)+ - (?: starts a non-capturing group, so findall() will give you the entire match as you expect.

Upvotes: 1

Related Questions