Reputation: 326
I wrote an ugly code which removes the fasta header and creates a variable with the protein sequence as a string. How could I do this more efficient? Is there a good way how to do this in biopython?
f = open('protein1.fasta', 'r')
raw_samples = f.readlines()
f.close()
samples = ''
for elem in raw_samples:
if elem[0] == '>':
raw_samples = elem[1:].rstrip()
else:
samples += elem.rstrip()
print samples
Upvotes: 0
Views: 3564
Reputation: 1217
An example using biopython and a dictionary to store the variable, in case accessing via a dictionary is advantageous in your situation:
dct = {}
for seq_record in SeqIO.parse(open(infile.fasta), "fasta"):
try:
dct['samples'].append(str(seq_record.seq).upper())
except:
dct['samples'] = str(seq_record.seq).upper()
Many tools introduce line-wrapping in fasta files, so for robustness, I would use biopython to import the file. @wflynny biopython + list comprehension solution is probably more efficient, but I would use the dictionary method if you want to have multiple variables, each linked to a sequence(s)
Upvotes: 0
Reputation: 18521
You want to do something like
sequences = []
with open('protein1.fasta', 'r') as fin:
sequence = ''
for line in fin:
if line.startswith('>'):
sequences.append(sequence)
sequence = ''
else:
sequence += line.strip()
With biopython, you could do
from Bio import AlignIO
alignment = AlignIO.read(open('protein1.fasta'), 'fasta')
sequences = [record.seq for record in alignment]
Edit: Actually what I've been doing most often, when my sequences have no linebreaks in them, is something like:
from itertools import izip_longest
sequences = []
with open('protein1.fasta', 'r') as fin:
for header, seq in izip_longest(*[fin]*2):
sequences.append(seq)
The important thing here is the zip(*[fin]*2)
which zips the file iterator fin
with itself ([fin]*2 == [fin, fin]
). Due to a.) the way the file iterators work and b.) that we're zipping it with itself, you can think of the zip
operation as
yield (fin.next(), fin.next())
which yield two lines at a time, which fits nicely with fasta files where sequences don't have line breaks.
Upvotes: 2