macrosage
macrosage

Reputation: 41

How to use Biopython to translate a series of DNA sequences in a FASTA file and extract the Protein sequences into a separate field?

I am new to Biopython (and coding in general) and am trying to code a way to translate a series of DNA sequences (more than 80) into protein sequences, in a separate FASTA file. I want to also find the sequence in the correct reading frame.

Here's what I have so far:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

if len(protein1) > len(protein2) and len(protein1) > len(protein3):
    protein = protein1
elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
    protein = protein2
else:
    protein = protein3

def prot_record(record):
    return SeqRecord(seq = protein, \
             id = ">" + protein_id, \
             description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")

The problem with my current code is that while it seems to work, it only give the last sequence of the input file. Can anyone help me figure out how to write all of the sequences?

Thank you!

Upvotes: 4

Views: 9325

Answers (2)

merv
merv

Reputation: 76740

As mentioned by others, your code is iterating through the entire input before attempting to write the result. I wanted to suggest how one might do this with a streaming approach:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

with open("AAseq.fasta", 'w') as aa_fa:
    for dna_record in SeqIO.parse("dnaseq.fasta", 'fasta'):
        # use both fwd and rev sequences
        dna_seqs = [dna_record.seq, dna_record.seq.reverse_complement()]

        # generate all translation frames
        aa_seqs = (s[i:].translate(to_stop=True) for i in range(3) for s in dna_seqs)

        # select the longest one
        max_aa = max(aa_seqs, key=len)

        # write new record
        aa_record = SeqRecord(max_aa, id=dna_record.id, description="translated sequence")
        SeqIO.write(aa_record, aa_fa, 'fasta')

The main improvements here are:

  1. Individual records are translated and outputted in each iteration, minimizing memory usage.
  2. Adds support for reverse complements.
  3. Translated frames are created through a generator comprehension, and only the longest length one is stored.
  4. Avoids if...elif...else structures by instead using max with a key.

Upvotes: 9

bli
bli

Reputation: 8184

Your if is outside the for loop, so it only applies once, using the variables with the values they had at the end of the last iteration of the loop. If you want the if to happen every iteration, you need to indent it at the same level as the code before:

for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # Same indentation level, still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3

Your function prot_record uses the current value of protein and protein_id, which are again what they were at the end of the last iteration of the for loop.

If I'm guessing correctly what you want, one possibility might be to put this function declaration inside the loop too, in order for the function to have one specific behaviour depending on the current iteration of the loop, and save the function in a list for later use, when iterating again over the records. I'm not certain this works, though:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

# List of functions:
record_makers = []
for record in SeqIO.parse("dnaseq.fasta", "fasta"):
    protein_id = record.id
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)
    # still in the loop
    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3
    # still in the loop
    def prot_record(record):
        return SeqRecord(seq = protein, \
                 id = ">" + protein_id, \
                 description = "translated sequence")
    record_makers.append(prot_record)

# zip the functions and the records together instead of
# mapping one single function to all the records
records = [record_maker(record) for (
    record_maker, record) in zip(
        record_makers, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]

Another possible approach is to put the translation logic inside the record-making function:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def find_translation(record):
    protein1 = record.seq.translate(to_stop=True)
    protein2 = record.seq[1:].translate(to_stop=True)
    protein3 = record.seq[2:].translate(to_stop=True)

    if len(protein1) > len(protein2) and len(protein1) > len(protein3):
        protein = protein1
    elif len(protein2) > len(protein1) and len(protein2) > len(protein3):
        protein = protein2
    else:
        protein = protein3
    return protein

def prot_record(record):
    protein = find_translation(record)
    # By the way: no need for backslashes here
    return SeqRecord(seq = protein,
                     id = ">" + record.id,
                     description = "translated sequence")

records = map(prot_record, SeqIO.parse("dnaseq.fasta", "fasta"))
SeqIO.write(records, "AAseq.fasta", "fasta")]

This is possibly cleaner. I haven't tested.

Upvotes: 2

Related Questions