idiot one
idiot one

Reputation: 389

Python Regex Capturing Group

string1 = "abcdbcdbcde"

I want to extract string1 into three parts: (first part and third part can be empty string)

first part: a

second part (repeitions of some string): bcdbcdbcd

third part: e

import re

string1 = "abcdbcdbcde"
m = re.match("(.*)(.+){2,}(.*)", string1)
print m.groups()[0], m.groups()[1], m.groups()[2]

Of cuz, the code above doesn't work.

As I know, parentheses operator can be used as RegEx capturing group or reference to the pattern. How to use the parentheses operator in these 2 cases at the same time?

What I want:

m.groups()[0] = "a"
m.groups()[1] = "bcdbcdbcd"
m.groups()[2] = "e"

Upvotes: 2

Views: 3169

Answers (4)

Andrej Kesely
Andrej Kesely

Reputation: 195643

My take on the problem:

import re

def match(s, m):
    m = re.match("(.*?)?((?:" + m + "){2,})(.*?)?$", s)
    return (m.groups()[0], m.groups()[1], m.groups()[2]) if m else (None, None, None)

print(match("abcdbcdbcde", "bcd"))
print(match("bcdbcdbcd", "bcd"))
print(match("abcdbcdbcd", "bcd"))
print(match("bcdbcdbcde", "bcd"))
print(match("axxbcdbcdxxe", "bcd"))
print(match("axxbcdxxe", "bcd")) # only one bcd in the middle

Prints:

('a', 'bcdbcdbcd', 'e')
('', 'bcdbcdbcd', '')
('a', 'bcdbcdbcd', '')
('', 'bcdbcdbcd', 'e')
('axx', 'bcdbcd', 'xxe')
(None, None, None)

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336498

The following regex should work (caveat below):

^(.*?)((.+?)\3+)(.*)

Explanation:

^      # Start of string
(.*?)  # Match any number of characters, as few as possible, until...
(      # (Start capturing group #2)
 (.+?) # ... a string is matched (and captured in group #3)
 \3+   # that is repeated at least once.
)      # End of group #2
(.*)   # Match the rest of the string

Test it live on regex101.com.

Caveat: If the string is long and doesn't have any obvious repeats, this is going to have very bad performance characteristics (O(n!), I think), since the regex engine has to check each and every permutation of substrings. See catastrophic backtracking.

Upvotes: 0

Michał Turczyn
Michał Turczyn

Reputation: 37525

I think it is impossible to match exatcly your requirements, as more captuing groups are needed (at least to repeat matching same string with \1).

But you can try (\w+)((\w+)\3+)(\w+)

It will consists of 4 capturing groups. Generally, first capturing group will contain a and last will contain e, second will contain repeated string, rest are irrelevant.

Explanation:

\w+ - match one or more of word characters

\3+ - match string captured in third capturing group, one ore more times

Demo

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163642

If the second part should be a repetition of the same string, you could use an optional first a and third part. For the second part you could use a capturing group and a back reference:

^.?(.+)\1+.?$

Regex demo

Or if you want all capturing groups:

^(.?)((.+)\3+)(.?)$
  • ^ Start of string
  • (.?) Group 1, optionally match any char
  • ( Group 2
    • (.+)\3+ Group 3, match any char followed by a backreference to group 3 repeated 1+ gimes
  • ) Close group 3
  • (.?) Group 4, optionally match any char
  • $ End of string

Regex demo

Upvotes: 3

Related Questions