lstbl
lstbl

Reputation: 537

re.finditer() returning same value for start and end methods

I'm having some trouble with the re.finditer() method in python. For example:

>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
>>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)]

out: [[22,22]]

As you can see, the start() and end() methods are giving the same value. I've noticed this before and just ended up using m.start()+len(query_sequence), instead of m.end(), but I am very confused why this is happening.

Upvotes: 7

Views: 3366

Answers (4)

Kijewski
Kijewski

Reputation: 26022

If the length of the subsequence is not known a-priori, then you can use a matching group inside the lookahead and take its span:

[m.span(1) for m in re.finditer(r'(?=(gatttaacg))',sequence)] == [(22,31)]

E.g. to find all repeated characters:

[m.span(1) for m in re.finditer(r'(?=(([acgt])\2+))',sequence)]

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

The regex module supports overlapping with finditer :

import regex
sequence = 'acaca'
print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)]
[0, 3], [2, 5]]

Upvotes: 6

Alyssa Haroldsen
Alyssa Haroldsen

Reputation: 3731

As specified, you are required to find overlapping matches and need the lookahead. However, you appear to know the exact string you're looking for. How about this?

def find_overlapping(sequence, matchstr):
    for m in re.finditer('(?={})'.format(matchstr)):
        yield (m.start(), m.start() + len(matchstr))

Alternatively, you could use the third-party Python regex module, as described here.

Upvotes: 1

vks
vks

Reputation: 67968

sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)]

remove the lookahead .It does not capture only asserts.

Output:[[22, 31]]

if you have to use lookahead use

sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)]

Upvotes: 2

Related Questions