Reputation: 335
I'm looking to find the index for all substrings in a string in python. My current regex code can't find a match that has it's start in a previous match.
I have a string: s = r'GATATATGCATATACTT'
and a subtring t = r'ATAT'
. There should be matches at index 1, 3, and 9. Using the following code only shows matches at index 1 and 9 because index 3 is within the first match. How do I get all matches to appear?
Thanks so much!
import re
s= 'GATATATGCATATACTT'
t = r'ATAT'
pattern = re.compile(t)
[print(i) for i in pattern.finditer(s)]
Upvotes: -1
Views: 214
Reputation: 104111
If you are looking for the location, your best bet is re.finditer:
import re
s = r'GATATATGCATATACTT'
t = r'ATAT'
>>> [m for m in re.finditer(rf'(?={t})', s)]
[<re.Match object; span=(1, 1), match=''>, <re.Match object; span=(3, 3), match=''>, <re.Match object; span=(9, 9), match=''>]
With the returned match object, you can get the start index:
>>> [m.start() for m in re.finditer(rf'(?={t})', s)]
[1, 3, 9]
You can also find overlapping sub string in pure Python:
def find_overlaps(s, sub):
start = 0
while True:
start = s.find(sub, start)
if start == -1: return
yield start
start += 1
>>> list(find_overlaps(s,t))
[1, 3, 9]
Upvotes: 0
Reputation: 6090
Since you have overlapping matches, you need to use a capturing group inside a lookahead as: (?=(YOUEXPR))
import re
s= 'GATATATGCATATACTT'
t = r'(?=(ATAT))'
pattern = re.compile(t)
[print(i) for i in pattern.finditer(s)]
Output:
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(9, 9), match=''>
Or:
[print(i.start()) for i in pattern.finditer(s)]
Output:
1
3
9
Or:
import re
s= 'GATATATGCATATACTT'
t = 'ATAT'
pattern = re.compile(f'(?=({t}))')
print ([(i.start(), s[i.start():i.start()+len(t)]) for i in pattern.finditer(s)])
Output:
[(1, 'ATAT'), (3, 'ATAT'), (9, 'ATAT')]
Upvotes: 0