Reputation: 51
I want to find all four-letter strings in a sequence. The first letter is 'N', the second one is not 'P', the third one is 'S' or 'T' and the last one is not 'P'.
Here's my code:
import re
seq='NNSTQ'
glyco=re.findall('N[^P][S|T][^P]',seq)
print glyco
and the result is:
['NNST']
However, the expected output should be:
['NNST','NSTQ']
I think the problem is that these two strings have overlapped part, and re.findall() just skips the second one. What can I do to solve it?
Upvotes: 2
Views: 65
Reputation: 682
findall()
does not return overlapping matches, but there is nothing to stop you from explicitly searching for them, for example with:
def myfindall(p, s):
found = []
i = 0
while True:
r = re.search(p, s[i:])
if r is None:
break
found.append(r.group())
i += r.start()+1
return found
seq='NNSTQ'
glyco=myfindall('N[^P][ST][^P]', seq)
Upvotes: 0
Reputation: 33724
You should use the (?=...)
(lookahead assertion) instead, since findall
only matches use the part of the string only once, with means, ignores overlapping:
import re
seq='NNSTQ'
glyco=re.findall('(?=(N[^P][S|T][^P]))',seq)
print (glyco)
# prints ['NNST','NSTQ']
This will match everything even if it overlaps. As the doc stated:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
You can also check this for more info:
http://regular-expressions.mobi/lookaround.html?wlr=1
Upvotes: 3