Reputation: 85
I have two FASTA file with DNA sequences coding for two different proteins. I want to join the sequences for the different proteins and same species into one long sequence.
for example, I have:
Protein 1
>sce
AGTAGATGACAGCT
>act
GCTAGCTAGCT
Protein 2
>sce
GCTACGATCGACT
>act
TACGATCAGCTA
Protein 1+2
>sce
AGTAGATGACAGCTGCTACGATCGACT
>act
GCTAGCTAGCTTACGATCAGCTA
Something that might be a bit of an issue is that the species don't appear in the same order in both files and there's a few sequences that are found in one, but not in the other (files are about 110-species long, with discrepancy of 4 or 5).
My first attempt at writing a code for it was:
gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
outfile = open("joined.fas", 'w')
for line in gamma:
if line.startswith(">"):
for line2 in spc:
if line2.startswith(">"):
if line == line2:
outfile.write(line)
else:
outfile.write(line)
fh.close()
but since the DNA sequences are very long and take many lines of the file, I don't know how to select them.
Please help!
Upvotes: 0
Views: 441
Reputation: 41168
Since you tagged Biopython, here is a compact solution. Note it puts the whole file into memory (as most simple approaches will):
from Bio.Seq import Seq
from Bio import SeqIO
d = SeqIO.to_dict(SeqIO.parse('1.fasta', 'fasta'))
for r in SeqIO.parse('2.fasta', 'fasta'):
d[r.id] = d.setdefault(r.id, Seq('')) + r.seq
SeqIO.write(d.values(), 'output.fasta', 'fasta')
Here 1.fasta
and 2.fasta
are your two input fasta files, and output.fasta
is your merged output file.
Also, note that biologically I think this is an odd thing to do, concatenating sequences across multiple files could lead to the creation of 'fake' contiguous sequences, and the order of concatenation is surely important, so be careful
Upvotes: 1
Reputation: 6573
By using a dictionary, you could append fasta sequences to each ID. And then, print them to the output file.
outfile = open("joined.fas", 'w')
d = dict()
for file in ('gamma.fas', 'spc98.fas'):
with open(file, 'r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
key = line
else:
d.setdefault(key, '')
d[key] += line
for key, seq in d.items():
outfile.write(key + "\n" + seq + "\n")
outfile.close()
EDIT: By the way, you are opening your two reading files as open for writing which will clobber the two input files.
gamma = open('gamma.fas', 'w')
spc = open("spc98.fas", 'w')
They should be opened with r
instead of w
.
Upvotes: 0