Reputation: 103
I know that this question has been asked before but I'm getting some really weird output from this. Basically I'm trying to convert a DNA sequence in .fasta format (i.e. an identifier beginning with a ">" followed by the sequence on the next line) to amino acid letters in the same format. I have the code:
#!/usr/bin/python
import sys
filename = sys.argv[1]
def translate_dna(sequence):
codontable = {
'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
'ATG':'M'
}
proteinsequence = ''
start = sequence.find('ATG')
sequencestart = sequence[int(start):]
stop = sequencestart.find('TAA')
cds = str(sequencestart[:int(stop)+3])
for n in range (0,len(cds),3):
if cds[n:n+3] in codontable:
proteinsequence += codontable[cds[n:n+3]]
print proteinsequence
sequence = ''
header = ''
sequence = ''
for line in open(filename):
if line[0] == ">":
if header != '':
print header
translate_dna(sequence)
header = line.strip()
sequence = ''
else:
sequence += line.strip()
print header
translate_dna(sequence)
My output is expected to come out like this: mouse_IPS1_cds MFAEDKTY (and so on and so on)
But I actually get this where it prints a new letter each line and does not complete to the end of the sequence: mouse_IPS1_cds M MF MFA MFAE MFAED MFAEDK MFAEDKT MFAEDKTY (it stops here when it should be longer)
The output thus makes this kind of half triangle incomplete list of letters that just increases by one each line.
Please, is there any way someone could point out what's making this happen? Why would it be printing a new letter each line and then not even finish the sequence?
Any help is greatly appreciated.
Upvotes: 1
Views: 252
Reputation: 6419
You're printing proteinsequence
on every iteration through the loop in which you're building it. As a result, you get each intermediate version. Move the print statement to the end of the loop, like this, and you'll only print out the final product:
for n in range (0,len(cds),3):
if cds[n:n+3] in codontable:
proteinsequence += codontable[cds[n:n+3]]
sequence = ''
print proteinsequence
Upvotes: 1