Reputation: 2214
I have a fasta file as follows:
>scaf1
AAAAAATGTGTGTGTGTGTGYAA
AAAAACACGTGTGTGTG
>scaf2
ACGTGTGTGTGATGTGGY
AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK
>scaf3
AAAGTGTGTTGTGAAACACACYAAW
I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:
{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}
The script I have written is:
import sys
from collections import defaultdict
fastaseq = open(sys.argv[1], "r")
def readfasta(fastaseq):
fasta_dict = {}
for line in fastaseq:
if line.startswith('>'):
header = line.strip('\n')[1:]
sequence = ''
else:
sequence = sequence + line.strip('\n')
fasta_dict[header] = sequence
return fasta_dict
fastadict = readfasta(fastaseq)
print fastadict
It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence
. I was wondering if there is any faster way of concatenating the lines to a single string?
Upvotes: 1
Views: 800
Reputation: 18007
Concatenating strings with +
requires to create a new string since Python strings are immutable, which is time consumer.
Use str.join
to concatenate them after all strings are ready,
import sys
def read_fasta(filename):
fasta_dict = {}
l = list()
header = None
with open(filename, 'r') as f:
for line in f:
if line.startswith('>'): # a new record
# save the previous record to the dict
if header:
fasta_dict[header] = ''.join(l)
del l[:] # empty the list
header = line.strip().split('>')[1]
else:
l.append(line.strip())
# save the last record
fasta_dict[header] = ''.join(l)
return fasta_dict
fastadict = read_fasta(sys.argv[1])
print(fastadict)
Upvotes: 5