how can I get all possible subgroups in python regex?

Question

I would like to get all possible subgroups during regex findall: (group(subgroup))+. Currently it only returns the last matches, for example:

>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]

Now I have to do that in two steps:

>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']

Is there any method can let me get the same result in a single step?

Wiktor Stribiżew · Accepted Answer

Since (([A-D])[0-9]+)+ is a repeated capturing group, it is no wonder only the last match results are returned.

You may use a PyPi regex library (that you may install by typing pip install regex in the console/terminal and pressing ENTER) and then use:

import regex

results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2))  for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]

The match.captures property keeps track of all captures.

If you can only use re, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:

import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
    results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]

See the Python demo

how can I get all possible subgroups in python regex?

Answers (2)

Related Questions