Reputation: 8173
I would like to get all possible subgroups during regex findall: (group(subgroup))+
. Currently it only returns the last matches, for example:
>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]
Now I have to do that in two steps:
>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']
Is there any method can let me get the same result in a single step?
Upvotes: 3
Views: 238
Reputation: 626903
Since (([A-D])[0-9]+)+
is a repeated capturing group, it is no wonder only the last match results are returned.
You may use a PyPi regex library (that you may install by typing pip install regex
in the console/terminal and pressing ENTER) and then use:
import regex
results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2)) for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
The match.captures
property keeps track of all captures.
If you can only use re
, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:
import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]
See the Python demo
Upvotes: 2
Reputation: 4648
A single-regex (and possibly single-pass-of-data) solution can be done, provided your sample code and sample data are both well-defined. The assumed premises are:
SOME_STRING_
is fixed. This is based on the example data you give, where SOME_STRING_
reads a literal string and not a regex.[E-Z]
or other exceptions in its "alphabet-digits" part. This is based on your working 2-lined solution, which should have returned an error AttributeError: 'NoneType' object has no attribute 'group'
if data like SOME_STRING_A1B2Z3_OTK
exists. However, the error was not reported, so I assume you did not have such data.If the above are met, a single regex r"[0-9]+"
can be used to perform a straightforward string split. All digits are discarded because the +
operator is greedy according to the official documentation. The greedy match could be theoretically done with a single pass of data, so the efficiency should be satisfying if it is indeed the case. (I did not have a check on the implementation details though.)
Solution
import re
s = 'SOME_STRING_A10B20C30_OTK' # len("SOME_STRING_") = 12 is fixed
# may have multiple digits in between
re.compile(r"[0-9]+").split(s[12:])[:-1] # discard the last element
# returns ['A', 'B', 'C']
Upvotes: 0