Reputation: 613
I have a string like '234 3452789 23 901234 ...'
. I want to extract all the numbers. I wrote the following regular expression:
s = '234 3452789 23 901234'
expr = r'^\s*(\d+\s*)+$'
e = re.match(expr, s)
print e.groups()
I expect to see a tuple containing all the numbers, but actually this code prints the latest number only:
('901234',)
Question: What's wrong in my code, and how to fix it?
P.S. The code below works well, but I want to parse strings with any number of substrings
expr = r'^\s*(\d+\s*)(\d+\s*)(\d+\s*)(\d+\s*)$'
e = re.match(expr, s)
print e.groups()
Upvotes: 2
Views: 1232
Reputation: 269
my simple two cents, to answer your actual question... why use regex and why not use
[int(grp) for grp in s.split() if grp.isdigit()]
This splits the string into groups based on the space separator, iterates through the list of separated groups, checks if it is a number or not if so pushes that group into list. The check is to make sure that we only push back numbers.
Its (a) faster
python -m timeit -s "import re" "[int(grp) for grp in re.findall('\d+','234 3452789 23 901234')]"
>> 100000 loops, best of 3: 4.14 usec per loop
python -m timeit "[int(grp) for grp in '234 3452789 23 901234'.split() if grp.isdigit()]"
>> 100000 loops, best of 3: 2.99 usec per loop
and (b) based on what i read from multiple discussions here... predictable and easy to understand. I tried once explaining the subtleties between re.findall
, re.search
, re.split
, re.finditer
. Took me some time. My recommendation try to avoid re
if you can.
Upvotes: 1
Reputation: 17761
TL;DR: use findall()
:
>>> s = '234 3452789 23 901234'
>>> re.findall('\d+', s)
['234', '3452789', '23', '901234']
I expect to see a tuple containing all the numbers, but actually this code prints the latest number only:
('901234',)
Question: What's wrong in my code, and how to fix it?
That's how match()
works, you can't do anything about it. A regular expression containing one group (like yours) you return only one group. Specifying a +
or a *
to the right of the group is the way for getting only the last match. It works this way by design.
If you really want to go with match()
, the regex third-party module provides the captures
and capturesdict
methods that do what you want. However it's not part of the standard library.
Upvotes: 2
Reputation: 174706
What's wrong with your first code?
r'^\s*(\d+\s*)+$'
regex should match all the digit or space characters from start and captures only the last digit characters and the following zero or more spaces since you're repeating the capturing group one or more times.
For example, '(1+)'
and (1)+
are doing the same match but both captures different set of 1's. First regex captures all the matched 1's where the second regex captures only the last 1 exists in each single match.
matchobj.groups()
would return a tuple of all characters which are captured by each single group.
Upvotes: 0
Reputation: 18633
It matches the entire string due to ^...$
, and only captures the last match for the (...)
. I assume it wasn't a strong enough use case, although someone filed an issue about allowing multiple matches to accumulate in a list.
The indexing of groups()
is based on the layout of capturing groups in your regex, and not the string it's used on, so you wouldn't get a group for each distinct occurrence anyway.
Upvotes: 0
Reputation: 1003
The $ towards the end would make it choose the last section only
Upvotes: -1