Reputation: 389
string1 = "abcdbcdbcde"
I want to extract string1 into three parts: (first part and third part can be empty string)
first part: a
second part (repeitions of some string): bcdbcdbcd
third part: e
import re
string1 = "abcdbcdbcde"
m = re.match("(.*)(.+){2,}(.*)", string1)
print m.groups()[0], m.groups()[1], m.groups()[2]
Of cuz, the code above doesn't work.
As I know, parentheses operator can be used as RegEx capturing group or reference to the pattern. How to use the parentheses operator in these 2 cases at the same time?
What I want:
m.groups()[0] = "a"
m.groups()[1] = "bcdbcdbcd"
m.groups()[2] = "e"
Upvotes: 2
Views: 3169
Reputation: 195643
My take on the problem:
import re
def match(s, m):
m = re.match("(.*?)?((?:" + m + "){2,})(.*?)?$", s)
return (m.groups()[0], m.groups()[1], m.groups()[2]) if m else (None, None, None)
print(match("abcdbcdbcde", "bcd"))
print(match("bcdbcdbcd", "bcd"))
print(match("abcdbcdbcd", "bcd"))
print(match("bcdbcdbcde", "bcd"))
print(match("axxbcdbcdxxe", "bcd"))
print(match("axxbcdxxe", "bcd")) # only one bcd in the middle
Prints:
('a', 'bcdbcdbcd', 'e')
('', 'bcdbcdbcd', '')
('a', 'bcdbcdbcd', '')
('', 'bcdbcdbcd', 'e')
('axx', 'bcdbcd', 'xxe')
(None, None, None)
Upvotes: 1
Reputation: 336498
The following regex should work (caveat below):
^(.*?)((.+?)\3+)(.*)
Explanation:
^ # Start of string
(.*?) # Match any number of characters, as few as possible, until...
( # (Start capturing group #2)
(.+?) # ... a string is matched (and captured in group #3)
\3+ # that is repeated at least once.
) # End of group #2
(.*) # Match the rest of the string
Test it live on regex101.com.
Caveat: If the string is long and doesn't have any obvious repeats, this is going to have very bad performance characteristics (O(n!)
, I think), since the regex engine has to check each and every permutation of substrings. See catastrophic backtracking.
Upvotes: 0
Reputation: 37525
I think it is impossible to match exatcly your requirements, as more captuing groups are needed (at least to repeat matching same string with \1
).
But you can try (\w+)((\w+)\3+)(\w+)
It will consists of 4 capturing groups. Generally, first capturing group will contain a
and last will contain e
, second will contain repeated string, rest are irrelevant.
Explanation:
\w+
- match one or more of word characters
\3+
- match string captured in third capturing group, one ore more times
Upvotes: 0
Reputation: 163642
If the second part should be a repetition of the same string, you could use an optional first a and third part. For the second part you could use a capturing group and a back reference:
^.?(.+)\1+.?$
Or if you want all capturing groups:
^(.?)((.+)\3+)(.?)$
^
Start of string(.?)
Group 1, optionally match any char(
Group 2
(.+)\3+
Group 3, match any char followed by a backreference to group 3 repeated 1+ gimes)
Close group 3(.?)
Group 4, optionally match any char$
End of stringUpvotes: 3