Hang Lin
Hang Lin

Reputation: 51

How to distinguish overlapped strings by regex in Python?

I want to find all four-letter strings in a sequence. The first letter is 'N', the second one is not 'P', the third one is 'S' or 'T' and the last one is not 'P'.

Here's my code:

import re
seq='NNSTQ'
glyco=re.findall('N[^P][S|T][^P]',seq)
print glyco

and the result is:

['NNST']

However, the expected output should be:

['NNST','NSTQ']

I think the problem is that these two strings have overlapped part, and re.findall() just skips the second one. What can I do to solve it?

Upvotes: 2

Views: 65

Answers (2)

MassPikeMike
MassPikeMike

Reputation: 682

findall() does not return overlapping matches, but there is nothing to stop you from explicitly searching for them, for example with:

def myfindall(p, s):
 found = []
 i = 0
 while True:
  r = re.search(p, s[i:])
  if r is None:
   break
  found.append(r.group())
  i += r.start()+1
 return found

seq='NNSTQ'
glyco=myfindall('N[^P][ST][^P]', seq)

Upvotes: 0

Taku
Taku

Reputation: 33724

You should use the (?=...) (lookahead assertion) instead, since findall only matches use the part of the string only once, with means, ignores overlapping:

import re
seq='NNSTQ'
glyco=re.findall('(?=(N[^P][S|T][^P]))',seq)
print (glyco) 
# prints ['NNST','NSTQ']

This will match everything even if it overlaps. As the doc stated:

(?=...)

Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

You can also check this for more info:

http://regular-expressions.mobi/lookaround.html?wlr=1

Upvotes: 3

Related Questions