Reputation: 81
I'm trying to concatenate hundreds of .fasta files into a single, large fasta file containing all of the sequences. I haven't found a specific method to accomplish this in the forums. I did come across this code from http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files, which I have adapted a bit.
Fasta.py contains the following code:
class fasta:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
def read_fasta(file):
items = []
index = 0
for line in file:
if line.startswith(">"):
if index >= 1:
items.append(aninstance)
index+=1
name = line[:-1]
seq = ''
aninstance = fasta(name, seq)
else:
seq += line[:-1]
aninstance = fasta(name, seq)
items.append(aninstance)
return items
And here is the adapted script to concatenate .fasta files:
import sys
import glob
import fasta
#obtain directory containing single fasta files for query
filepattern = input('Filename pattern to match: ')
#obtain output directory
outfile = input('Filename of output file: ')
#create new output file
output = open(outfile, 'w')
#initialize lists
names = []
seqs = []
#glob.glob returns a list of files that match the pattern
for file in glob.glob(filepattern):
print ("file: " + file)
#we read the contents and an instance of the class is returned
contents = fasta.read_fasta(open(file).readlines())
#a file can contain more than one sequence so we read them in a loop
for item in contents:
names.append(item.name)
seqs.append(item.sequence)
#we print the output
for i in range(len(names)):
output.write(names[i] + '\n' + seqs[i] + '\n\n')
output.close()
print("done")
It is able to read the fasta files but the newly created output file contains no sequences. The error I receive is due to the fasta.py, which is beyond my capability to mess with:
Traceback (most recent call last):
File "C:\Python32\myfiles\test\3\Fasta_Concatenate.py", line 28, in <module>
contents = fasta.read_fasta(open(file).readlines())
File "C:\Python32\lib\fasta.py", line 18, in read_fasta
seq += line[:-1]
UnboundLocalError: local variable 'seq' referenced before assignment
Any suggestions? Thanks!
Upvotes: 2
Views: 13648
Reputation: 26481
The following ensures that new files always start on a new line:
$ awk 1 *.fasta > largefile.fasta
The solution using cat
might fail on that:
$ echo -n foo > f1
$ echo bar > f2
$ cat f1 f2
foobar
$ awk 1 f1 f2
foo
bar
Upvotes: 2
Reputation: 618
For windows OS via command prompt: (Note-folder should contain only required files) :
copy *.fasta **space** final.fasta
Enjoy.
Upvotes: 1
Reputation: 2809
Not a python programer but it seems that question code tries to condense the data for each sequence in a single line and also separate sequence with a blank line.
>seq1
00000000
11111111
>seq2
22222222
33333333
would become
>seq1
0000000011111111
>seq2
2222222233333333
If this is in fact needed the cat based solution above would not work. Otherwise the cat is the simplest and most effective solution.
Upvotes: 1
Reputation: 54392
I think using python
for this job is overkill. On the command line, a quick way to concatenate single/multiple fasta files with the .fasta
or .fa
extensions is to simply:
cat *.fa* > newfile.txt
Upvotes: 8
Reputation: 12069
The problem is in fasta.py
:
else:
seq += line[:-1]
aninstance = fasta(name, seq)
Try initializing seq
before at the start of read_fasta(file)
.
EDIT: Further explanation
When you first call read_fasta
, the first line in the file does not start with >
, thus you append the first line to the string seq
which has not be initialized yet (not even declared): you are appending a string (the first line) to a null value. The error present in the stack explains the problem:
UnboundLocalError: local variable 'seq' referenced before assignment
Upvotes: 1