Reputation: 980

Aligning DNA sequences inside python

I have thousands of DNA sequences ranged between 100 to 5000 bp and I need to align and calculate the identity score for specified pairs. Biopython pairwise2 does a nice job but only for short sequences and when the sequence size get bigger than 2kb it shows severe memory leakage which leads to 'MemoryError', even when 'score_only' and 'one_alignment_only' options are used!!

whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
   alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
   whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))

Result returned from supercomputer:

Max vmem         = 256.114G  #Memory usage of the script
failed assumedly after job because:
job 4945543.1 died through signal XCPU (24)

I know there are other tools for alignment, but they mainly can just write the score in output file which need to be read and parsed again for retrieving and using the alignment scores. Are there any tool which can align the sequences and return the alignment score inside python environment as pairwise2 does but without memory leakage?

Upvotes: 5

Answers (3)

Markus

Reputation: 405

Biopython can (now). The pairwise2 module in Biopython ver. 1.68 is (much) faster and can take longer sequences. Here is a comparison of the new and the old pairwise2 (on 32bit Python 2.7.11 with has a 2 GByte memory limit, 64bit Win7, Intel Core i5, 2.8 GHz):

from Bio import pairwise2

length_of_sequence = ...
seq1 = '...'[:length_of_sequence]  # Two very long DNA sequences, e.g. human and green monkey
seq2 = '...'[:length_of_sequence]  # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp)
aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)

Old pairwise2
- max length/time: ~1,900 chars/10 sec
New pairwise2
- max length/time: ~7450 chars/12 sec
- time for 1,900 chars: 1 sec

With score_only set to True, the new pairwise2 can do two sequences of ~8400 chars in 6 sec.

Upvotes: 1

Ward

Reputation: 2852

First, I used BioPython's needle for that. A nice howto (ignore the legacy design :-) ) can be found here

Second: maybe you can avoid getting the whole set into memory by using a generator? I do not know where your 'whole_coding' object is coming from. But, if it is a file, make sure you do not read the whole file, and then iterate over the memory object. For example:

whole_coding = open('big_file', 'rt').readlines() # Will consume memory

but

for gene in open('big_file', 'rt'):     # will not read the whole thing into memory first
    process(gene)

If your needs processing, you could write a generator funtion:

def gene_yielder(filename):
    for line in open('filename', 'rt'):
        line.strip()   # Here you preprocess your data
        yield line     # This will return

then

for gene in  gene_yielder('big_file'):
    process_gene(gene)

Basically, you want your program to act as a pipe: things flow through it, and get processed. Do not use it as cooking pot when preparing bouillon: adding everything, and apply heat. I hope this comparison is not to far fetched :-)

Upvotes: 4

jpalmer

Reputation: 197

For global alignment could try NWalign https://pypi.python.org/pypi/nwalign/. I have not used it but seems like you can recover the alignment score within your script.

Otherwise perhaps the EMBOSS tools could help: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

Upvotes: 2

Aligning DNA sequences inside python

Answers (3)

Related Questions