Reputation: 814
I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.
an re.findall of
"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
should give me:
['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']
I have tried all of the following without success:
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)
What am I missing here?
Upvotes: 2
Views: 1255
Reputation: 70602
I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.
If you have heavy need for this, trying a match on O(N^2)
pairs of indices may soon become unbearably slow. One improvement is to use the .search()
method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.
It also uses the .fullmatch()
method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $
to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch()
was added in Python 3.4, so this code also requires Python 3!
Finally, this intends to generalize the re
module's finditer()
function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.
So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:
def finditer_overlap(regexp, string):
start = 0
n = len(string)
while start <= n:
# don't know whether regexp will find shortest or
# longest match, but _will_ find leftmost match
m = regexp.search(string, start)
if m is None:
return
start = m.start()
for finish in range(start, n+1):
m = regexp.fullmatch(string, start, finish)
if m is not None:
yield m
start += 1
Then, e.g.,
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
print(match.group())
prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start
is 1, and regexp.search(string, 1)
fails to find another match, so the generator exits at once (so skips checking O(N^2)
other index pairs).
Upvotes: 0
Reputation: 27283
This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.
As long as the strings are fairly short, you can check every substring:
import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')
for start in range(len(s)-6): # string is at least 6 letters long
for end in range(start+6, len(s)):
if p.match(s, pos=start, endpos=end):
print(s[start:end])
This prints:
ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG
Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.
Upvotes: 3