Incorrect output format when translating DNA to Protein

Question

I know that this question has been asked before but I'm getting some really weird output from this. Basically I'm trying to convert a DNA sequence in .fasta format (i.e. an identifier beginning with a ">" followed by the sequence on the next line) to amino acid letters in the same format. I have the code:

#!/usr/bin/python

import sys

filename = sys.argv[1]

def translate_dna(sequence):

    codontable = {
    'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    'ATG':'M'
    }
    proteinsequence = ''
    start = sequence.find('ATG')
    sequencestart = sequence[int(start):]
    stop = sequencestart.find('TAA')
    cds = str(sequencestart[:int(stop)+3])

    for n in range (0,len(cds),3):
            if cds[n:n+3] in codontable:
                    proteinsequence += codontable[cds[n:n+3]]
                    print proteinsequence
            sequence = ''

header = ''
sequence = ''
for line in open(filename):
    if line[0] == ">":
            if header != '':
                    print header
                    translate_dna(sequence)

            header = line.strip()
            sequence = ''
    else:
            sequence += line.strip()

print header
translate_dna(sequence)

My output is expected to come out like this: mouse_IPS1_cds MFAEDKTY (and so on and so on)

But I actually get this where it prints a new letter each line and does not complete to the end of the sequence: mouse_IPS1_cds M MF MFA MFAE MFAED MFAEDK MFAEDKT MFAEDKTY (it stops here when it should be longer)

The output thus makes this kind of half triangle incomplete list of letters that just increases by one each line.

Please, is there any way someone could point out what's making this happen? Why would it be printing a new letter each line and then not even finish the sequence?

Any help is greatly appreciated.

seaotternerd · Accepted Answer

You're printing proteinsequence on every iteration through the loop in which you're building it. As a result, you get each intermediate version. Move the print statement to the end of the loop, like this, and you'll only print out the final product:

for n in range (0,len(cds),3):
        if cds[n:n+3] in codontable:
                proteinsequence += codontable[cds[n:n+3]]
        sequence = ''
print proteinsequence

Incorrect output format when translating DNA to Protein

Answers (1)

Related Questions