Pedro Henrique
Pedro Henrique

Reputation: 69

Join distinct FASTA files using python and Biopython

I have to create a software that pick multi fasta files and create another with all the sequences. For that I have done the following code:

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO

# Use: python join_fasta.py infile1.fasta infile2.fasta outfile.fasta


infile1 = sys.argv[1]                                #Name of the input file
seq1 = list(SeqIO.parse(infile1,"fasta"))             #Create a list with all the sequence records
print "Input the first fasta file = ", infile1

infile2 = sys.argv[2]                                #Name of the input file
seq2 = list(SeqIO.parse(infile2,"fasta"))             #Create a list with all the sequence records
print "Input the second fasta file = ", infile2

totseq1 = len(seq1)                                   #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq1

totseq2 = len(seq2)                                   #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq2

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

totseq3 = len(seq1) + len(seq2)                                   #Total number of sequences in the output file
print "Number of sequences in the concatenated file = ", totseq3

output = [infile1, infile2]




SeqIO.write(output, outfile, "fasta") 

exit()

It return a error:

Input the first fasta file =  ITS.fasta
Input the second fasta file =  banco_de_dados_ITS.fasta
Number of sequences in the first file =  114
Number of sequences in the first file =  20
Output fasta file =  ITS_combined.fasta
Number of sequences in the concatenated file =  134
Traceback (most recent call last):
  File "/Users/phdias/Documents/Softwares_PH/join_fasta.py", line 35, in <module>
    SeqIO.write(output, outfile, "fasta") 
  File "/usr/local/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 535, in write
    fp.write(format_function(record))
  File "/usr/local/lib/python2.7/site-packages/Bio/SeqIO/FastaIO.py", line 345, in as_fasta
    id = _clean(record.id)
AttributeError: 'str' object has no attribute 'id'

I try a lot to resolve this problem without success

The code was rewritten:

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import AlignIO

# Use: python join_fasta.py infile1.fasta infile2.fasta outfile.fasta


infile1 = sys.argv[1]                                #Name of the input file
seq1 = list(SeqIO.parse(infile1,"fasta"))             #Create a list with all the sequence records
print "Input the first fasta file = ", infile1

infile2 = sys.argv[2]                                #Name of the input file
seq2 = list(SeqIO.parse(infile2,"fasta"))             #Create a list with all the sequence records
print "Input the second fasta file = ", infile2

totseq1 = len(seq1)                                   #Total number of sequences in the input file
print "Number of sequences in the first file = ", totseq1

totseq2 = len(seq2)                                   #Total number of sequences in the input file
print "Number of sequences in the second file = ", totseq2

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

totseq3 = len(seq1) + len(seq2)                                   #Total number of sequences in the output file
print "Number of sequences in the concatenated file = ", totseq3

from itertools import chain
output = list(chain(seq1, seq2))


SeqIO.write(output, outfile, "fasta") 

exit()

And now it not find the second file, the output file just contains the sequences from the first one:

Input the first fasta file =  ITS.fasta
Input the second fasta file =  banco_de_dados_ITS.fasta
Number of sequences in the first file =  114
Number of sequences in the second file =  0
Output fasta file =  its_c.fasta
Number of sequences in the concatenated file =  114

Upvotes: 1

Views: 1341

Answers (2)

cmbfast
cmbfast

Reputation: 489

While BioPython is a great tool, things will be easier if you used pandas.

import pandas as pd

def fasta_reader(file):
    fasta_df = pd.read_csv(file, sep='>', lineterminator='>', header=None)
    fasta_df[['Accession', 'Sequence']] = fasta_df[0].str.split('\n', 1, \
                                        expand=True)
    fasta_df.drop(0, axis=1, inplace=True)
    fasta_df['Sequence'] = fasta_df['Sequence'].replace('\n', '', regex=True)
    return fasta_df

If your fasta files are named as test1.fa, test2.fa, you can then use pd.concat to join.

df = pd.concat(fasta_reader(i) for i in ['test1.fa', 'test2.fa'])

Then you can also export the concatenated file as a fasta. But first you need to add > in front of accessions.

# Exporting to fa
# adding '>' for accessions
df['Accession'] = '>' + df['Accession']
df.to_csv('combined.fa', sep='\n', index=None, header=None)

Upvotes: 1

ronpi
ronpi

Reputation: 490

I think the problem is here:

output = [infile1, infile2]

You are using the file paths which are of type str instead of using seq1 and seq2, and also you need to concatenate them to a single sequence, like that:

from itertools import chain
output = chain(seq1, seq2)

This way output is a generator of SeqRecord items from both seq1 and seq2

Upvotes: 2

Related Questions