Reputation: 69
I have to create a software that pick multi fasta files and create another with all the sequences. For that I have done the following code:
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO
# Use: python join_fasta.py infile1.fasta infile2.fasta outfile.fasta
infile1 = sys.argv[1] #Name of the input file
seq1 = list(SeqIO.parse(infile1,"fasta")) #Create a list with all the sequence records
print "Input the first fasta file = ", infile1
infile2 = sys.argv[2] #Name of the input file
seq2 = list(SeqIO.parse(infile2,"fasta")) #Create a list with all the sequence records
print "Input the second fasta file = ", infile2
totseq1 = len(seq1) #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq1
totseq2 = len(seq2) #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq2
outfile = sys.argv[3] #Name of the output file
print "Output fasta file = ", outfile
totseq3 = len(seq1) + len(seq2) #Total number of sequences in the output file
print "Number of sequences in the concatenated file = ", totseq3
output = [infile1, infile2]
SeqIO.write(output, outfile, "fasta")
exit()
It return a error:
Input the first fasta file = ITS.fasta
Input the second fasta file = banco_de_dados_ITS.fasta
Number of sequences in the first file = 114
Number of sequences in the first file = 20
Output fasta file = ITS_combined.fasta
Number of sequences in the concatenated file = 134
Traceback (most recent call last):
File "/Users/phdias/Documents/Softwares_PH/join_fasta.py", line 35, in <module>
SeqIO.write(output, outfile, "fasta")
File "/usr/local/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 535, in write
fp.write(format_function(record))
File "/usr/local/lib/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 345, in as_fasta
id = _clean(record.id)
AttributeError: 'str' object has no attribute 'id'
I try a lot to resolve this problem without success
The code was rewritten:
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO
# Use: python join_fasta.py infile1.fasta infile2.fasta outfile.fasta
infile1 = sys.argv[1] #Name of the input file
seq1 = list(SeqIO.parse(infile1,"fasta")) #Create a list with all the sequence records
print "Input the first fasta file = ", infile1
infile2 = sys.argv[2] #Name of the input file
seq2 = list(SeqIO.parse(infile2,"fasta")) #Create a list with all the sequence records
print "Input the second fasta file = ", infile2
totseq1 = len(seq1) #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq1
totseq2 = len(seq2) #Total number of sequences in the input file
print "Number of sequences in the second file = ", totseq2
outfile = sys.argv[3] #Name of the output file
print "Output fasta file = ", outfile
totseq3 = len(seq1) + len(seq2) #Total number of sequences in the output file
print "Number of sequences in the concatenated file = ", totseq3
from itertools import chain
output = list(chain(seq1, seq2))
SeqIO.write(output, outfile, "fasta")
exit()
And now it not find the second file, the output file just contains the sequences from the first one:
Input the first fasta file = ITS.fasta
Input the second fasta file = banco_de_dados_ITS.fasta
Number of sequences in the first file = 114
Number of sequences in the second file = 0
Output fasta file = its_c.fasta
Number of sequences in the concatenated file = 114
Upvotes: 1
Views: 1341
Reputation: 489
While BioPython is a great tool, things will be easier if you used pandas.
import pandas as pd
def fasta_reader(file):
fasta_df = pd.read_csv(file, sep='>', lineterminator='>', header=None)
fasta_df[['Accession', 'Sequence']] = fasta_df[0].str.split('\n', 1, \
expand=True)
fasta_df.drop(0, axis=1, inplace=True)
fasta_df['Sequence'] = fasta_df['Sequence'].replace('\n', '', regex=True)
return fasta_df
If your fasta files are named as test1.fa
, test2.fa
, you can then use pd.concat
to join.
df = pd.concat(fasta_reader(i) for i in ['test1.fa', 'test2.fa'])
Then you can also export the concatenated file as a fasta. But first you need to add >
in front of accessions.
# Exporting to fa
# adding '>' for accessions
df['Accession'] = '>' + df['Accession']
df.to_csv('combined.fa', sep='\n', index=None, header=None)
Upvotes: 1
Reputation: 490
I think the problem is here:
output = [infile1, infile2]
You are using the file paths which are of type str
instead of using seq1
and seq2
, and also you need to concatenate them to a single sequence, like that:
from itertools import chain
output = chain(seq1, seq2)
This way output
is a generator of SeqRecord
items from both seq1
and seq2
Upvotes: 2