Trying to find efficient ways to remove headers in fasta files

Question

I wrote an ugly code which removes the fasta header and creates a variable with the protein sequence as a string. How could I do this more efficient? Is there a good way how to do this in biopython?

f = open('protein1.fasta', 'r')
raw_samples = f.readlines()
f.close()

samples = ''

for elem in raw_samples:
    if elem[0] == '>':
        raw_samples = elem[1:].rstrip()
    else:
        samples += elem.rstrip()

print samples

wflynny · Accepted Answer

You want to do something like

sequences = []
with open('protein1.fasta', 'r') as fin:
    sequence = ''
    for line in fin:
        if line.startswith('>'):
            sequences.append(sequence)
            sequence = ''
        else:
            sequence += line.strip()

With biopython, you could do

from Bio import AlignIO
alignment = AlignIO.read(open('protein1.fasta'), 'fasta')
sequences = [record.seq for record in alignment]

Edit: Actually what I've been doing most often, when my sequences have no linebreaks in them, is something like:

from itertools import izip_longest
sequences = []
with open('protein1.fasta', 'r') as fin:
    for header, seq in izip_longest(*[fin]*2):
        sequences.append(seq)

The important thing here is the zip(*[fin]*2) which zips the file iterator fin with itself ([fin]*2 == [fin, fin]). Due to a.) the way the file iterators work and b.) that we're zipping it with itself, you can think of the zip operation as

yield (fin.next(), fin.next())

which yield two lines at a time, which fits nicely with fasta files where sequences don't have line breaks.

Trying to find efficient ways to remove headers in fasta files

Answers (2)

Related Questions