Reputation: 41
I am new to Biopython (and coding in general) and am trying to code a way to translate a series of DNA sequences (more than 80) into protein sequences, in a separate FASTA file. I want to also find the sequence in the correct reading frame.
Here's what I have so far:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
def prot_record(record):
return SeqRecord(seq = protein, \
id = ">" + protein_id, \
description = "translated sequence")
records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")
The problem with my current code is that while it seems to work, it only give the last sequence of the input file. Can anyone help me figure out how to write all of the sequences?
Thank you!
Upvotes: 4
Views: 9325
Reputation: 76740
As mentioned by others, your code is iterating through the entire input before attempting to write the result. I wanted to suggest how one might do this with a streaming approach:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
with open("AAseq.fasta", 'w') as aa_fa:
for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
# use both fwd and rev sequences
dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]
# generate all translation frames
aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)
# select the longest one
max_aa = max(aa_seqs, key=len)
# write new record
aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
SeqIO.write(aa_record, aa_fa, 'fasta')
The main improvements here are:
if...elif...else
structures by instead using max
with a key.Upvotes: 9
Reputation: 8184
Your if
is outside the for
loop, so it only applies once, using the variables with the values they had at the end of the last iteration of the loop. If you want the if
to happen every iteration, you need to indent it at the same level as the code before:
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
# Same indentation level, still in the loop
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
Your function prot_record
uses the current value of protein
and protein_id
, which are again what they were at the end of the last iteration of the for
loop.
If I'm guessing correctly what you want, one possibility might be to put this function declaration inside the loop too, in order for the function to have one specific behaviour depending on the current iteration of the loop, and save the function in a list for later use, when iterating again over the records. I'm not certain this works, though:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
# List of functions:
record_makers = []
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
protein_id = record.id
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
# still in the loop
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
# still in the loop
def prot_record(record):
return SeqRecord(seq = protein, \
id = ">" + protein_id, \
description = "translated sequence")
record_makers.append(prot_record)
# zip the functions and the records together instead of
# mapping one single function to all the records
records = [record_maker(record) for (
record_maker, record) in zip(
record_makers, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]
Another possible approach is to put the translation logic inside the record-making function:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
def find_translation(record):
protein1 = record.seq.translate(to_stop=True)
protein2 = record.seq[1:].translate(to_stop=True)
protein3 = record.seq[2:].translate(to_stop=True)
if len(protein1) > len(protein2) and len(protein1) > len(protein3):
protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
protein = protein2
else:
protein = protein3
return protein
def prot_record(record):
protein = find_translation(record)
# By the way: no need for backslashes here
return SeqRecord(seq = protein,
id = ">" + record.id,
description = "translated sequence")
records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]
This is possibly cleaner. I haven't tested.
Upvotes: 2