Maksim Surov
Maksim Surov

Reputation: 613

Python re.groups doesn't show all the subgroups

I have a string like '234 3452789 23 901234 ...'. I want to extract all the numbers. I wrote the following regular expression:

s = '234 3452789 23 901234'
expr = r'^\s*(\d+\s*)+$'
e = re.match(expr, s)
print e.groups()

I expect to see a tuple containing all the numbers, but actually this code prints the latest number only:

('901234',)

Question: What's wrong in my code, and how to fix it?

P.S. The code below works well, but I want to parse strings with any number of substrings

expr = r'^\s*(\d+\s*)(\d+\s*)(\d+\s*)(\d+\s*)$'
e = re.match(expr, s)
print e.groups()

Upvotes: 2

Views: 1232

Answers (5)

fahad daniyal
fahad daniyal

Reputation: 269

my simple two cents, to answer your actual question... why use regex and why not use

[int(grp) for grp in s.split() if grp.isdigit()]

This splits the string into groups based on the space separator, iterates through the list of separated groups, checks if it is a number or not if so pushes that group into list. The check is to make sure that we only push back numbers.

Its (a) faster

python -m timeit -s "import re" "[int(grp) for grp in re.findall('\d+','234 3452789 23 901234')]"
>> 100000 loops, best of 3: 4.14 usec per loop

python -m timeit "[int(grp) for grp in '234 3452789 23 901234'.split() if grp.isdigit()]"
>> 100000 loops, best of 3: 2.99 usec per loop

and (b) based on what i read from multiple discussions here... predictable and easy to understand. I tried once explaining the subtleties between re.findall, re.search, re.split, re.finditer. Took me some time. My recommendation try to avoid re if you can.

Upvotes: 1

Andrea Corbellini
Andrea Corbellini

Reputation: 17761

TL;DR: use findall():

>>> s = '234 3452789 23 901234'
>>> re.findall('\d+', s)
['234', '3452789', '23', '901234']

I expect to see a tuple containing all the numbers, but actually this code prints the latest number only:

('901234',)

Question: What's wrong in my code, and how to fix it?

That's how match() works, you can't do anything about it. A regular expression containing one group (like yours) you return only one group. Specifying a + or a * to the right of the group is the way for getting only the last match. It works this way by design.

If you really want to go with match(), the regex third-party module provides the captures and capturesdict methods that do what you want. However it's not part of the standard library.

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174706

What's wrong with your first code?

r'^\s*(\d+\s*)+$' regex should match all the digit or space characters from start and captures only the last digit characters and the following zero or more spaces since you're repeating the capturing group one or more times.

For example, '(1+)' and (1)+ are doing the same match but both captures different set of 1's. First regex captures all the matched 1's where the second regex captures only the last 1 exists in each single match.

matchobj.groups() would return a tuple of all characters which are captured by each single group.

Upvotes: 0

Vlad
Vlad

Reputation: 18633

It matches the entire string due to ^...$, and only captures the last match for the (...). I assume it wasn't a strong enough use case, although someone filed an issue about allowing multiple matches to accumulate in a list.

The indexing of groups() is based on the layout of capturing groups in your regex, and not the string it's used on, so you wouldn't get a group for each distinct occurrence anyway.

Upvotes: 0

viral_mutant
viral_mutant

Reputation: 1003

The $ towards the end would make it choose the last section only

Upvotes: -1

Related Questions