Reputation: 537
I'm having some trouble with the re.finditer() method in python. For example:
>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
>>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)]
out: [[22,22]]
As you can see, the start()
and end()
methods are giving the same value. I've noticed this before and just ended up using m.start()+len(query_sequence)
, instead of m.end()
, but I am very confused why this is happening.
Upvotes: 7
Views: 3366
Reputation: 26022
If the length of the subsequence is not known a-priori, then you can use a matching group inside the lookahead and take its span:
[m.span(1) for m in re.finditer(r'(?=(gatttaacg))',sequence)] == [(22,31)]
E.g. to find all repeated characters:
[m.span(1) for m in re.finditer(r'(?=(([acgt])\2+))',sequence)]
Upvotes: 1
Reputation: 180441
The regex module supports overlapping with finditer :
import regex
sequence = 'acaca'
print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)]
[0, 3], [2, 5]]
Upvotes: 6
Reputation: 3731
As specified, you are required to find overlapping matches and need the lookahead. However, you appear to know the exact string you're looking for. How about this?
def find_overlapping(sequence, matchstr):
for m in re.finditer('(?={})'.format(matchstr)):
yield (m.start(), m.start() + len(matchstr))
Alternatively, you could use the third-party Python regex module, as described here.
Upvotes: 1
Reputation: 67968
sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)]
remove the lookahead
.It does not capture only asserts.
Output:[[22, 31]]
if you have to use lookahead
use
sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)]
Upvotes: 2