Reputation: 819
I tried some code like below.
re.findall(r'(\d{2}){2}', 'shs111111111')
I want to get the result like
11111111
but the result is
['11', '11']
Edit:
I make some errors at the example, what I really need is to find the all repeated substrings.
Like this:
re.findall(r'([actg]{2,}){2,}', 'aaaaaaaccccctttttttttt')
I prefer the result is ['aaaaaaa', 'ccccc', 'tttttttttt']
But I got ['aa', 'cc', 'tt']
What's the problem and how can I do?
Upvotes: 1
Views: 548
Reputation: 626729
You cannot obtain pure ['aaaaaaa', 'ccccc', 'tttttttttt']
because you need a capture group to check for repetition using the back-reference. Here, you have a regex with named group letter
that will hold a
, or b
, etc. and then the (?P=letter)+)
back-reference is used to match all the group repetition.
((?P<letter>[a-zA-Z])(?P=letter)+)
You can only use this regex with a finditer
described in @anubhava's post.
Upvotes: 0
Reputation: 784958
I believe you need this regex:
>>> print re.findall(r'(?:\d{2}){2,}', 'shs111111111');
['11111111']
EDIT: Based on edited question you can use:
>>> print re.findall(r'(([actg\d])\2+)', 'aaaaaaaccccctttttttttt');
[('aaaaaaa', 'a'), ('ccccc', 'c'), ('tttttttttt', 't')]
And grab captured group #1 from each pair.
Using finditer
:
>>> arr=[]
>>> for match in re.finditer(r'(([actg\d])\2+)', 'aaaaaaaccccctttttttttt') :
... arr.append( match.groups()[0] )
...
>>> print arr
['aaaaaaa', 'ccccc', 'tttttttttt']
Upvotes: 1
Reputation: 67968
re.findall
returns all the groups. So use
re.findall(r'(?:\d{2}){2}', 'shs111111111')
Just make the group non capturing
.
Relevant doc excerpt:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
(([acgt])\2+)
Use this and
x="aaaaaaaccccctttttttttt"
print [i[0] for i in re.findall(r'(([acgt])\2+)', 'aaaaaaaccccctttttttttt')]
Upvotes: 1