Kennedy
Kennedy

Reputation: 317

Regular expression can't match multiple groups of the same type

I'm playing with Regex's in python. I know there is a ton of documentation on this. But I just can't understand this apparently simple example:

On this code:

import re
phoneNumRegex = re.compile(r'(\d\d\d)*')
mo = phoneNumRegex.search('My number is 415-555-4242. 423-531-5412')
print(mo.group())

I'm expecting to get the output:

415, 555, 423, 531

However the program only returns an empty string(nothing). My logic was to specify that I want a group of 3 digits and then the * specifies to match this kind of group 0 or 'n' times. Since I have multiple 3 digit groups in my string I was expecting to get all of them printed. What am I doing wrong?I tried with the + as well instead of * which by my understanding is supposed to find the group at least once. If I do that it only prints the first group and not all as I would expect. How should I write this to get all 3 digit groups printed?

Upvotes: 3

Views: 1498

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

You have defined a repeated capturing group. The (\d\d\d)* pattern matches and captures into a capturing group with ID 1 any 3 digits, zero or more times (due to the * quantifier), that is, if there is no digit at a certain location inside the string, an empty string will be captured, and if there are 6 consecutive digits, it will match them all, but the capturing group memory buffer will contain the last 3. See your pattern demo with multiple matching enabled.

However, in your code, you are using re.search, a method that only returns a single (the first) match. Since the engine tries to match a string from left to right, it checks the starting position and finds M. It is not a digit, so the pattern matches an empty string before M (due to * quantifier).

So, if you use re.findall, you will get many empty strings inside the resulting list using the pattern.

As a quick fix you would use + quantifier, 1 or more repetitions, but it would still return 3 digit chunks located at the end of each digit chunks.

The solution is to use a multiple matching method, like re.findall or re.finditer without an enclosing quantified grouping construct, r'\d{3}', or in case you need to match a 3-digit number not enclosed with other digits, r'(?<!\d)\d{3}(?!\d)' or r'\b\d{3}\b' to match the 3-digit chunks as a whole word. See a sample regex demo.

Upvotes: 2

Rakesh
Rakesh

Reputation: 82765

Use re.findall

Ex:

import re
phoneNumRegex = re.compile(r'(\b\d{3}\b)')
mo = phoneNumRegex.findall('My number is 415-555-4242. 423-531-5412')
print(mo)

Output:

['415', '555', '423', '531']

Upvotes: 2

Related Questions