Homap
Homap

Reputation: 2214

Concatenating lines to a string in python

I have a fasta file as follows:

>scaf1
AAAAAATGTGTGTGTGTGTGYAA
AAAAACACGTGTGTGTG
>scaf2
ACGTGTGTGTGATGTGGY
AAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK
>scaf3
AAAGTGTGTTGTGAAACACACYAAW

I want to read it into a dictionary in a away that multiple lines belonging to one sequence go to one key, the output would be:

{'scaf1': 'AAAAAATGTGTGTGTGTGTGYAAAAAAACACGTGTGTGTG', 'scaf2': 'ACGTGTGTGTGATGTGGYAAAAAATGTGNNNNNNNNYACGTGTGTGTGTGTGTACACWSK', 'scaf3': 'AAAGTGTGTTGTGAAACACACYAAW'}

The script I have written is:

import sys
from collections import defaultdict

fastaseq = open(sys.argv[1], "r")

def readfasta(fastaseq):
    fasta_dict = {}
    for line in fastaseq:
        if line.startswith('>'):
            header = line.strip('\n')[1:]
            sequence = ''
        else:
            sequence = sequence + line.strip('\n')
        fasta_dict[header] = sequence 
    return fasta_dict

fastadict = readfasta(fastaseq)
print fastadict

It works correctly and fast for such a file but when the file size increases (that is about 1.5 Gb), then it becomes too slow. The step that is taking time is the concatenation part of the sequence. I was wondering if there is any faster way of concatenating the lines to a single string?

Upvotes: 1

Views: 800

Answers (1)

SparkAndShine
SparkAndShine

Reputation: 18007

Concatenating strings with + requires to create a new string since Python strings are immutable, which is time consumer.

Use str.join to concatenate them after all strings are ready,

import sys

def read_fasta(filename):
    fasta_dict = {}
    l = list()
    header = None
    with open(filename, 'r') as f:
        for line in f:
            if line.startswith('>'): # a new record
                # save the previous record to the dict
                if header:
                    fasta_dict[header] = ''.join(l) 
                    del l[:]    # empty the list

                header = line.strip().split('>')[1]
            else:
                l.append(line.strip())

        # save the last record
        fasta_dict[header] = ''.join(l) 

    return fasta_dict

fastadict = read_fasta(sys.argv[1])
print(fastadict)

Upvotes: 5

Related Questions