Lum Ramabaja
Lum Ramabaja

Reputation: 326

Trying to find efficient ways to remove headers in fasta files

I wrote an ugly code which removes the fasta header and creates a variable with the protein sequence as a string. How could I do this more efficient? Is there a good way how to do this in biopython?

f = open('protein1.fasta', 'r')
raw_samples = f.readlines()
f.close()

samples = ''

for elem in raw_samples:
    if elem[0] == '>':
        raw_samples = elem[1:].rstrip()
    else:
        samples += elem.rstrip()

print samples 

Upvotes: 0

Views: 3564

Answers (2)

Colin Anthony
Colin Anthony

Reputation: 1217

An example using biopython and a dictionary to store the variable, in case accessing via a dictionary is advantageous in your situation:

dct = {}
for seq_record in SeqIO.parse(open(infile.fasta), "fasta"):
    try:
        dct['samples'].append(str(seq_record.seq).upper())
    except:
        dct['samples'] = str(seq_record.seq).upper()

Many tools introduce line-wrapping in fasta files, so for robustness, I would use biopython to import the file. @wflynny biopython + list comprehension solution is probably more efficient, but I would use the dictionary method if you want to have multiple variables, each linked to a sequence(s)

Upvotes: 0

wflynny
wflynny

Reputation: 18521

You want to do something like

sequences = []
with open('protein1.fasta', 'r') as fin:
    sequence = ''
    for line in fin:
        if line.startswith('>'):
            sequences.append(sequence)
            sequence = ''
        else:
            sequence += line.strip()

With biopython, you could do

from Bio import AlignIO
alignment = AlignIO.read(open('protein1.fasta'), 'fasta')
sequences = [record.seq for record in alignment]

Edit: Actually what I've been doing most often, when my sequences have no linebreaks in them, is something like:

from itertools import izip_longest
sequences = []
with open('protein1.fasta', 'r') as fin:
    for header, seq in izip_longest(*[fin]*2):
        sequences.append(seq)

The important thing here is the zip(*[fin]*2) which zips the file iterator fin with itself ([fin]*2 == [fin, fin]). Due to a.) the way the file iterators work and b.) that we're zipping it with itself, you can think of the zip operation as

yield (fin.next(), fin.next())

which yield two lines at a time, which fits nicely with fasta files where sequences don't have line breaks.

Upvotes: 2

Related Questions