Vicky
Vicky

Reputation: 11

how to find locations of the substring given by the recognition sequence in the genome

i want locations of the substring given by the recognition sequence in the genome. wrote a function which pass through the sequence and store the indices where we match the recognition sequence but its returning empty conatiner. the sequence looks like

ATCGGCGCGCGCGCGTATATATATATATATAGGCGCGCGCGCGCGTATATATATATATAGCGGCGCGCGCGCG
def restriction_sites(seq, recog_seq):
    """Find the indices of all restriction sites in a sequence."""
    #here variable "seq" contains fasta sequence and "recog_seq" contains recognition sequence
    # Initialize list of restriction sites
    sites = []

    # Check every substring for a match
    for i in range(len(seq) - len(recog_seq)):
        if seq[i:i+len(recog_seq)] == recog_seq:
            sites.append(i)

    return sites

when i run it like this:

print('HindIII:', restriction_sites(seq, 'AAGCTT'))

it shows me empty:

[]

i want output like this

HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]

Upvotes: 1

Views: 125

Answers (2)

ptalebic
ptalebic

Reputation: 47

A simple way would be this

s = "ATCGGCGCGCGCGCGTATATATATATATATAGGCGCGCGCGCGCGTATATATATATATAGCGGCGCGCGCGCG"
r = "TATA"
HindIII = []
for i in range(len(s)):
    if s[i:i+4] == r:
        HindIII.append(i)
print(HindIII)

Upvotes: 1

Vovin
Vovin

Reputation: 760

Your original function works. It returned you a blank list because your example seq has not 'AAGCTT'. For instance, it returned [15, 17, 19, 21, 23, 25, 27, 45, 47, 49, 51, 53, 55] to me for "TATA".

I could recommend use re for matching sites. That looks more explicit:

import re
from typing import *

def restriction_sites(seq: str, recog_seq: str) -> List[int]:
    return [match.start() for match in re.finditer(fr"(?={recog_seq})", seq)]

Upvotes: 2

Related Questions