user1426421
user1426421

Reputation: 81

Concatenating Multiple .fasta Files

I'm trying to concatenate hundreds of .fasta files into a single, large fasta file containing all of the sequences. I haven't found a specific method to accomplish this in the forums. I did come across this code from http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files, which I have adapted a bit.

Fasta.py contains the following code:

class fasta:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

def read_fasta(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
           if index >= 1:
               items.append(aninstance)
           index+=1
           name = line[:-1]
           seq = ''
           aninstance = fasta(name, seq)
        else:
           seq += line[:-1]
           aninstance = fasta(name, seq)

    items.append(aninstance)
    return items

And here is the adapted script to concatenate .fasta files:

import sys
import glob
import fasta

#obtain directory containing single fasta files for query
filepattern = input('Filename pattern to match: ')

#obtain output directory
outfile = input('Filename of output file: ')

#create new output file
output = open(outfile, 'w')

#initialize lists
names = []
seqs = []

#glob.glob returns a list of files that match the pattern
for file in glob.glob(filepattern):

    print ("file: " + file)

    #we read the contents and an instance of the class is returned
    contents = fasta.read_fasta(open(file).readlines())

    #a file can contain more than one sequence so we read them in a loop
    for item in contents:
        names.append(item.name)
        seqs.append(item.sequence)

#we print the output
for i in range(len(names)):
    output.write(names[i] + '\n' + seqs[i] + '\n\n')

output.close()
print("done")

It is able to read the fasta files but the newly created output file contains no sequences. The error I receive is due to the fasta.py, which is beyond my capability to mess with:

Traceback (most recent call last):
  File "C:\Python32\myfiles\test\3\Fasta_Concatenate.py", line 28, in <module>
    contents = fasta.read_fasta(open(file).readlines())
  File "C:\Python32\lib\fasta.py", line 18, in read_fasta
    seq += line[:-1]
UnboundLocalError: local variable 'seq' referenced before assignment

Any suggestions? Thanks!

Upvotes: 2

Views: 13648

Answers (5)

kvantour
kvantour

Reputation: 26481

The following ensures that new files always start on a new line:

$ awk 1 *.fasta > largefile.fasta

The solution using cat might fail on that:

$ echo -n foo > f1
$ echo bar > f2
$ cat f1 f2
foobar
$ awk 1 f1 f2
foo
bar

Upvotes: 2

BioDeveloper
BioDeveloper

Reputation: 618

For windows OS via command prompt: (Note-folder should contain only required files) :

copy *.fasta **space** final.fasta  

Enjoy.

Upvotes: 1

Valentin Ruano
Valentin Ruano

Reputation: 2809

Not a python programer but it seems that question code tries to condense the data for each sequence in a single line and also separate sequence with a blank line.

  >seq1
  00000000
  11111111
  >seq2
  22222222
  33333333

would become

  >seq1
  0000000011111111

  >seq2
  2222222233333333

If this is in fact needed the cat based solution above would not work. Otherwise the cat is the simplest and most effective solution.

Upvotes: 1

Steve
Steve

Reputation: 54392

I think using python for this job is overkill. On the command line, a quick way to concatenate single/multiple fasta files with the .fasta or .fa extensions is to simply:

cat *.fa* > newfile.txt

Upvotes: 8

aayoubi
aayoubi

Reputation: 12069

The problem is in fasta.py:

else:
       seq += line[:-1]
       aninstance = fasta(name, seq)

Try initializing seq before at the start of read_fasta(file).

EDIT: Further explanation

When you first call read_fasta, the first line in the file does not start with >, thus you append the first line to the string seq which has not be initialized yet (not even declared): you are appending a string (the first line) to a null value. The error present in the stack explains the problem:

UnboundLocalError: local variable 'seq' referenced before assignment

Upvotes: 1

Related Questions