Join distinct FASTA files using python and Biopython

Question

I have to create a software that pick multi fasta files and create another with all the sequences. For that I have done the following code:

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO

# Use: python join_fasta.py infile1.fasta infile2.fasta outfile.fasta


infile1 = sys.argv[1]                                #Name of the input file
seq1 = list(SeqIO.parse(infile1,"fasta"))             #Create a list with all the sequence records
print "Input the first fasta file = ", infile1

infile2 = sys.argv[2]                                #Name of the input file
seq2 = list(SeqIO.parse(infile2,"fasta"))             #Create a list with all the sequence records
print "Input the second fasta file = ", infile2

totseq1 = len(seq1)                                   #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq1

totseq2 = len(seq2)                                   #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq2

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

totseq3 = len(seq1) + len(seq2)                                   #Total number of sequences in the output file
print "Number of sequences in the concatenated file = ", totseq3

output = [infile1, infile2]




SeqIO.write(output, outfile, "fasta") 

exit()

It return a error:

Input the first fasta file =  ITS.fasta
Input the second fasta file =  banco_de_dados_ITS.fasta
Number of sequences in the first file =  114
Number of sequences in the first file =  20
Output fasta file =  ITS_combined.fasta
Number of sequences in the concatenated file =  134
Traceback (most recent call last):
  File "/Users/phdias/Documents/Softwares_PH/join_fasta.py", line 35, in 
    SeqIO.write(output, outfile, "fasta") 
  File "/usr/local/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 535, in write
    fp.write(format_function(record))
  File "/usr/local/lib/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 345, in as_fasta
    id = _clean(record.id)
AttributeError: 'str' object has no attribute 'id'

I try a lot to resolve this problem without success

The code was rewritten:

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO

# Use: python join_fasta.py infile1.fasta infile2.fasta outfile.fasta


infile1 = sys.argv[1]                                #Name of the input file
seq1 = list(SeqIO.parse(infile1,"fasta"))             #Create a list with all the sequence records
print "Input the first fasta file = ", infile1

infile2 = sys.argv[2]                                #Name of the input file
seq2 = list(SeqIO.parse(infile2,"fasta"))             #Create a list with all the sequence records
print "Input the second fasta file = ", infile2

totseq1 = len(seq1)                                   #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq1

totseq2 = len(seq2)                                   #Total number of sequences in the input file
print "Number of sequences in the second file = ", totseq2

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

totseq3 = len(seq1) + len(seq2)                                   #Total number of sequences in the output file
print "Number of sequences in the concatenated file = ", totseq3

from itertools import chain
output = list(chain(seq1, seq2))


SeqIO.write(output, outfile, "fasta") 

exit()

And now it not find the second file, the output file just contains the sequences from the first one:

Input the first fasta file =  ITS.fasta
Input the second fasta file =  banco_de_dados_ITS.fasta
Number of sequences in the first file =  114
Number of sequences in the second file =  0
Output fasta file =  its_c.fasta
Number of sequences in the concatenated file =  114

ronpi · Accepted Answer

I think the problem is here:

output = [infile1, infile2]

You are using the file paths which are of type str instead of using seq1 and seq2, and also you need to concatenate them to a single sequence, like that:

from itertools import chain
output = chain(seq1, seq2)

This way output is a generator of SeqRecord items from both seq1 and seq2

Join distinct FASTA files using python and Biopython

Answers (2)

Related Questions