Slicing DNA sequence using chunks

Question

I want to make a Python Program in which a DNA sequence is given in a text file. It has more than 9000 characters. I have to cut the sequence in 3 characters so our frame reads from 1 to 3, then 4 to 6, then 7 to 9, which is called as codons.

For Example the sequence is

ACCTGCCTCTTACGAGGCGACACTCCACCATGGATCACTCCCCTGTGAGGAACTACTGTCTTCACGCAGA

then I have to cut it in 3 characters. Which I have already done it. My question is how can I take out the GENE sequence from the given DNA? GENE sequence starts from ATG and end it on TAG or TAA or TGA.

It is easy to do if I use Regular Expression. But the problem is if you look at the above sequence the ATG is coming from 30th position to 32nd. While our frame reads from 1 to 3 then 4 to 6. In this case when it reaches to 28th to 30th, it doesn't make ATG.

Can anyone understand my problem and please help me? I'm sharing my code now:

import numpy as np
import pandas as pd
import re
from pathlib import Path
dna = Path('C:/Users/abdul/Downloads/Compressed/MAJU/HCV-PK1-sequence - 
edited.txt').read_text()
l = [c for c in dna if c!='
']
r = len(l)
for x in range(0,r,3):
    y=x+3
    codon = l[x:y]
    a = ''.join(codon)
    print(a)
if(a == re.findall('ATG(...)+?(TAG|TAA|TGA)', dna)):
    print("Yes")

T Burgis · Accepted Answer

Loop over the 3 reading frames like so:

dna = ''.join(dna)
for frame in [0,1,2]:
    codons = [dna[x:x+3] for x in range(frame,len(dna)-2,3)]

But the correct answer is to install biopython and use its sequence manipulation functions. It will also help you read your sequence from file.

A solution that doesn't use biopython:

def find_orf(seq,start):
    for pos in range(start+3,len(seq)-2,3):
        codon = seq[pos:pos+3]
        if codon in ['TAA','TAG','TGA']:
            return seq[start:pos+3]
    return seq[start:] # if we don't find inframe stop codon return whole sequence from start codon to end


# Assuming seq is a string, not a list of characters:
seq = 'ACCTGCCTCTTACGAGGCGACACTCCACCATGGATCACTCCCCTGTGAGGAACTACTGTCAGCCTAATTAATAAGGTAAC'
orfs = []
for frame in [0,1,2]:
    for pos in range(frame,len(seq)-2,3):
        codon = seq[pos:pos+3]
        if codon == 'ATG':
            orf = find_orf(seq,pos)
            orfs.append(orf)

print(orfs)

Slicing DNA sequence using chunks

Answers (2)

Related Questions