Wang
Wang

Reputation: 8173

how can I get all possible subgroups in python regex?

I would like to get all possible subgroups during regex findall: (group(subgroup))+. Currently it only returns the last matches, for example:

>>> re.findall(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
[('C3', 'C')]

Now I have to do that in two steps:

>>> match = re.match(r'SOME_STRING_(([A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
>>> re.findall(r'([A-D])[0-9]+', match.group(1))
['A', 'B', 'C']

Is there any method can let me get the same result in a single step?

Upvotes: 3

Views: 238

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

Since (([A-D])[0-9]+)+ is a repeated capturing group, it is no wonder only the last match results are returned.

You may use a PyPi regex library (that you may install by typing pip install regex in the console/terminal and pressing ENTER) and then use:

import regex

results = regex.finditer(r'SOME_STRING_(([A-D])[0-9]+)+_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
print( [zip(x.captures(1),x.captures(2))  for x in results] )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]

The match.captures property keeps track of all captures.

If you can only use re, you need to first extract all your matches, and then run a second regex on them to extract the parts you need:

import re
tmp = re.findall(r'SOME_STRING_((?:[A-D][0-9]+)+)_[A-Z]+', 'SOME_STRING_A2B2C3_OTK')
results = []
for m in tmp:
    results.append(re.findall(r'(([A-D])[0-9]+)', m))
print( results )
# => [[('A2', 'A'), ('B2', 'B'), ('C3', 'C')]]

See the Python demo

Upvotes: 2

Bill Huang
Bill Huang

Reputation: 4648

A single-regex (and possibly single-pass-of-data) solution can be done, provided your sample code and sample data are both well-defined. The assumed premises are:

  1. The length of SOME_STRING_ is fixed. This is based on the example data you give, where SOME_STRING_ reads a literal string and not a regex.
  2. The data contains no [E-Z] or other exceptions in its "alphabet-digits" part. This is based on your working 2-lined solution, which should have returned an error AttributeError: 'NoneType' object has no attribute 'group' if data like SOME_STRING_A1B2Z3_OTK exists. However, the error was not reported, so I assume you did not have such data.

If the above are met, a single regex r"[0-9]+" can be used to perform a straightforward string split. All digits are discarded because the + operator is greedy according to the official documentation. The greedy match could be theoretically done with a single pass of data, so the efficiency should be satisfying if it is indeed the case. (I did not have a check on the implementation details though.)

Solution

import re    
s = 'SOME_STRING_A10B20C30_OTK'  # len("SOME_STRING_") = 12 is fixed
                                 # may have multiple digits in between

re.compile(r"[0-9]+").split(s[12:])[:-1]  # discard the last element
# returns ['A', 'B', 'C']

Upvotes: 0

Related Questions