Reputation: 11
i want locations of the substring given by the recognition sequence in the genome. wrote a function which pass through the sequence and store the indices where we match the recognition sequence but its returning empty conatiner. the sequence looks like
ATCGGCGCGCGCGCGTATATATATATATATAGGCGCGCGCGCGCGTATATATATATATAGCGGCGCGCGCGCG
def restriction_sites(seq, recog_seq):
"""Find the indices of all restriction sites in a sequence."""
#here variable "seq" contains fasta sequence and "recog_seq" contains recognition sequence
# Initialize list of restriction sites
sites = []
# Check every substring for a match
for i in range(len(seq) - len(recog_seq)):
if seq[i:i+len(recog_seq)] == recog_seq:
sites.append(i)
return sites
when i run it like this:
print('HindIII:', restriction_sites(seq, 'AAGCTT'))
it shows me empty:
[]
i want output like this
HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
Upvotes: 1
Views: 125
Reputation: 47
A simple way would be this
s = "ATCGGCGCGCGCGCGTATATATATATATATAGGCGCGCGCGCGCGTATATATATATATAGCGGCGCGCGCGCG"
r = "TATA"
HindIII = []
for i in range(len(s)):
if s[i:i+4] == r:
HindIII.append(i)
print(HindIII)
Upvotes: 1
Reputation: 760
Your original function works. It returned you a blank list because your example seq has not 'AAGCTT'
. For instance, it returned [15, 17, 19, 21, 23, 25, 27, 45, 47, 49, 51, 53, 55]
to me for "TATA"
.
I could recommend use re
for matching sites. That looks more explicit:
import re
from typing import *
def restriction_sites(seq: str, recog_seq: str) -> List[int]:
return [match.start() for match in re.finditer(fr"(?={recog_seq})", seq)]
Upvotes: 2