Reputation: 11
I want to make a Python Program in which a DNA sequence is given in a text file. It has more than 9000 characters. I have to cut the sequence in 3 characters
so our frame reads from 1 to 3
, then 4 to 6,
then 7 to 9
, which is called as codons.
For Example the sequence is
ACCTGCCTCTTACGAGGCGACACTCCACCATGGATCACTCCCCTGTGAGGAACTACTGTCTTCACGCAGA
then I have to cut it in 3 characters. Which I have already done it. My question is how can I take out the GENE sequence from the given DNA? GENE sequence starts from ATG
and end it on TAG
or TAA
or TGA
.
It is easy to do if I use Regular Expression
. But the problem is if you look at the above sequence the ATG
is coming from 30th position to 32nd. While our frame reads from 1 to 3
then 4 to 6
. In this case when it reaches to 28th to 30th
, it doesn't make ATG
.
Can anyone understand my problem and please help me? I'm sharing my code now:
import numpy as np
import pandas as pd
import re
from pathlib import Path
dna = Path('C:/Users/abdul/Downloads/Compressed/MAJU/HCV-PK1-sequence -
edited.txt').read_text()
l = [c for c in dna if c!='\n']
r = len(l)
for x in range(0,r,3):
y=x+3
codon = l[x:y]
a = ''.join(codon)
print(a)
if(a == re.findall('ATG(...)+?(TAG|TAA|TGA)', dna)):
print("Yes")
Upvotes: 1
Views: 1095
Reputation: 1435
Loop over the 3 reading frames like so:
dna = ''.join(dna)
for frame in [0,1,2]:
codons = [dna[x:x+3] for x in range(frame,len(dna)-2,3)]
But the correct answer is to install biopython and use its sequence manipulation functions. It will also help you read your sequence from file.
A solution that doesn't use biopython:
def find_orf(seq,start):
for pos in range(start+3,len(seq)-2,3):
codon = seq[pos:pos+3]
if codon in ['TAA','TAG','TGA']:
return seq[start:pos+3]
return seq[start:] # if we don't find inframe stop codon return whole sequence from start codon to end
# Assuming seq is a string, not a list of characters:
seq = 'ACCTGCCTCTTACGAGGCGACACTCCACCATGGATCACTCCCCTGTGAGGAACTACTGTCAGCCTAATTAATAAGGTAAC'
orfs = []
for frame in [0,1,2]:
for pos in range(frame,len(seq)-2,3):
codon = seq[pos:pos+3]
if codon == 'ATG':
orf = find_orf(seq,pos)
orfs.append(orf)
print(orfs)
Upvotes: 1
Reputation: 48367
Then just change the frame range in order to read from 1 to 3
, 2 to 4
and so on.
You could do this by using slicing
feature in combination with range
function.
dna = "ACCTGCCTCTTACGAGGCGACACTCCACCATGGATCACTCCCCTGTGAGGAACTACTGTCTTCACGCAGA"
sequence_length = 3
lst = [dna[i:i+sequence_length] for i in range(0, len(dna) - sequence_length + 1, 1)]
Output
=> ['ACC', 'CCT', 'CTG', 'TGC', 'GCC', 'CCT', 'CTC', 'TCT', 'CTT', 'TTA', 'TAC', 'ACG', 'CGA', 'GAG', 'AGG', 'GGC', 'GCG', 'CGA', 'GAC', 'ACA', 'CAC', 'ACT', 'CTC', 'TCC', 'CCA', 'CAC', 'ACC', 'CCA', 'CAT', 'ATG', 'TGG', 'GGA', 'GAT', 'ATC', 'TCA', 'CAC', 'ACT', 'CTC', 'TCC', 'CCC', 'CCC', 'CCT', 'CTG', 'TGT', 'GTG', 'TGA', 'GAG', 'AGG', 'GGA', 'GAA', 'AAC', 'ACT', 'CTA', 'TAC', 'ACT', 'CTG', 'TGT', 'GTC', 'TCT', 'CTT', 'TTC', 'TCA', 'CAC', 'ACG', 'CGC', 'GCA', 'CAG', 'AGA']
Upvotes: 0