Ana Catarina Vitorino
Ana Catarina Vitorino

Reputation: 85

Joining DNA sequences from two files under the same species name

I have two FASTA file with DNA sequences coding for two different proteins. I want to join the sequences for the different proteins and same species into one long sequence.

for example, I have:

Protein 1
>sce
AGTAGATGACAGCT
>act
GCTAGCTAGCT
Protein 2
>sce
GCTACGATCGACT
>act
TACGATCAGCTA
Protein 1+2
>sce
AGTAGATGACAGCTGCTACGATCGACT
>act
GCTAGCTAGCTTACGATCAGCTA

Something that might be a bit of an issue is that the species don't appear in the same order in both files and there's a few sequences that are found in one, but not in the other (files are about 110-species long, with discrepancy of 4 or 5).

My first attempt at writing a code for it was:

gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
outfile = open("joined.fas", 'w')
for line in gamma:
    if line.startswith(">"):
        for line2 in spc:
             if line2.startswith(">"):
                if line == line2:
                    outfile.write(line)
    else:
        outfile.write(line)
fh.close()

but since the DNA sequences are very long and take many lines of the file, I don't know how to select them.

Please help!

Upvotes: 0

Views: 441

Answers (2)

Chris_Rands
Chris_Rands

Reputation: 41168

Since you tagged Biopython, here is a compact solution. Note it puts the whole file into memory (as most simple approaches will):

from Bio.Seq import Seq
from Bio import SeqIO

d = SeqIO.to_dict(SeqIO.parse('1.fasta', 'fasta'))

for r in SeqIO.parse('2.fasta', 'fasta'):
    d[r.id] = d.setdefault(r.id, Seq('')) + r.seq

SeqIO.write(d.values(), 'output.fasta', 'fasta')

Here 1.fasta and 2.fasta are your two input fasta files, and output.fasta is your merged output file.

Also, note that biologically I think this is an odd thing to do, concatenating sequences across multiple files could lead to the creation of 'fake' contiguous sequences, and the order of concatenation is surely important, so be careful

Upvotes: 1

Chris Charley
Chris Charley

Reputation: 6573

By using a dictionary, you could append fasta sequences to each ID. And then, print them to the output file.

outfile = open("joined.fas", 'w')

d = dict()

for file in ('gamma.fas', 'spc98.fas'):
    with open(file, 'r') as f:
        for line in f:
            line = line.rstrip()
            if line.startswith('>'):
                key = line
            else:
                d.setdefault(key, '')
                d[key] += line

for key, seq in d.items():
    outfile.write(key + "\n" + seq + "\n")

outfile.close()

EDIT: By the way, you are opening your two reading files as open for writing which will clobber the two input files.

gamma = open('gamma.fas', 'w') spc = open("spc98.fas", 'w')

They should be opened with r instead of w.

Upvotes: 0

Related Questions