Oliver
Oliver

Reputation: 335

Python Regex Finding a Match That Starts Inside Previous match

I'm looking to find the index for all substrings in a string in python. My current regex code can't find a match that has it's start in a previous match.

I have a string: s = r'GATATATGCATATACTT' and a subtring t = r'ATAT'. There should be matches at index 1, 3, and 9. Using the following code only shows matches at index 1 and 9 because index 3 is within the first match. How do I get all matches to appear?

Thanks so much!

import re

s= 'GATATATGCATATACTT'
t = r'ATAT'

pattern = re.compile(t)

[print(i) for i in pattern.finditer(s)]

Upvotes: -1

Views: 214

Answers (2)

dawg
dawg

Reputation: 104111

If you are looking for the location, your best bet is re.finditer:

import re 

s = r'GATATATGCATATACTT'
t = r'ATAT'

>>> [m for m in re.finditer(rf'(?={t})', s)]
[<re.Match object; span=(1, 1), match=''>, <re.Match object; span=(3, 3), match=''>, <re.Match object; span=(9, 9), match=''>]

With the returned match object, you can get the start index:

>>> [m.start() for m in re.finditer(rf'(?={t})', s)]
[1, 3, 9]

You can also find overlapping sub string in pure Python:

def find_overlaps(s, sub):
    start = 0
    while True:
        start = s.find(sub, start)
        if start == -1: return
        yield start
        start += 1

>>> list(find_overlaps(s,t))
[1, 3, 9]

Upvotes: 0

Synthaze
Synthaze

Reputation: 6090

Since you have overlapping matches, you need to use a capturing group inside a lookahead as: (?=(YOUEXPR))

import re

s= 'GATATATGCATATACTT'
t = r'(?=(ATAT))'

pattern = re.compile(t)

[print(i) for i in pattern.finditer(s)]

Output:

<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(9, 9), match=''>

Or:

[print(i.start()) for i in pattern.finditer(s)]

Output:

1
3
9

Or:

import re

s= 'GATATATGCATATACTT'
t = 'ATAT'

pattern = re.compile(f'(?=({t}))')

print ([(i.start(), s[i.start():i.start()+len(t)]) for i in pattern.finditer(s)])

Output:

[(1, 'ATAT'), (3, 'ATAT'), (9, 'ATAT')]

Upvotes: 0

Related Questions