Reputation: 819

Why is the re.findall method return the wrong result in python?

I tried some code like below.

re.findall(r'(\d{2}){2}', 'shs111111111')

I want to get the result like

11111111

but the result is

['11', '11']

Edit:

I make some errors at the example, what I really need is to find the all repeated substrings.

Like this:

re.findall(r'([actg]{2,}){2,}', 'aaaaaaaccccctttttttttt')

I prefer the result is ['aaaaaaa', 'ccccc', 'tttttttttt']

But I got ['aa', 'cc', 'tt']

What's the problem and how can I do?

Upvotes: 1

Answers (3)

Wiktor Stribiżew

Reputation: 626729

You cannot obtain pure ['aaaaaaa', 'ccccc', 'tttttttttt'] because you need a capture group to check for repetition using the back-reference. Here, you have a regex with named group letter that will hold a, or b, etc. and then the (?P=letter)+) back-reference is used to match all the group repetition.

((?P<letter>[a-zA-Z])(?P=letter)+)

You can only use this regex with a finditer described in @anubhava's post.

Upvotes: 0

anubhava

Reputation: 784958

I believe you need this regex:

>>> print re.findall(r'(?:\d{2}){2,}', 'shs111111111');
['11111111']

EDIT: Based on edited question you can use:

>>> print re.findall(r'(([actg\d])\2+)', 'aaaaaaaccccctttttttttt');
[('aaaaaaa', 'a'), ('ccccc', 'c'), ('tttttttttt', 't')]

And grab captured group #1 from each pair.

Using finditer:

>>> arr=[]
>>> for match in re.finditer(r'(([actg\d])\2+)', 'aaaaaaaccccctttttttttt') :
...     arr.append( match.groups()[0] )
...
>>> print arr
['aaaaaaa', 'ccccc', 'tttttttttt']

Upvotes: 1

vks

Reputation: 67968

re.findall returns all the groups. So use

re.findall(r'(?:\d{2}){2}', 'shs111111111')

Just make the group non capturing.

Relevant doc excerpt:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

(([acgt])\2+)

Use this and

x="aaaaaaaccccctttttttttt"
print [i[0] for i in re.findall(r'(([acgt])\2+)', 'aaaaaaaccccctttttttttt')]

Upvotes: 1

Why is the re.findall method return the wrong result in python?

Answers (3)

Related Questions