Reputation: 980
I have thousands of DNA sequences ranged between 100 to 5000 bp and I need to align and calculate the identity score for specified pairs. Biopython pairwise2 does a nice job but only for short sequences and when the sequence size get bigger than 2kb it shows severe memory leakage which leads to 'MemoryError', even when 'score_only' and 'one_alignment_only' options are used!!
whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))
Result returned from supercomputer:
Max vmem = 256.114G #Memory usage of the script
failed assumedly after job because:
job 4945543.1 died through signal XCPU (24)
I know there are other tools for alignment, but they mainly can just write the score in output file which need to be read and parsed again for retrieving and using the alignment scores. Are there any tool which can align the sequences and return the alignment score inside python environment as pairwise2 does but without memory leakage?
Upvotes: 5
Views: 3243
Reputation: 405
Biopython can (now). The pairwise2
module in Biopython ver. 1.68 is (much) faster and can take longer sequences.
Here is a comparison of the new and the old pairwise2 (on 32bit Python 2.7.11 with has a 2 GByte memory limit, 64bit Win7, Intel Core i5, 2.8 GHz):
from Bio import pairwise2
length_of_sequence = ...
seq1 = '...'[:length_of_sequence] # Two very long DNA sequences, e.g. human and green monkey
seq2 = '...'[:length_of_sequence] # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp)
aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)
Old pairwise2
New pairwise2
With score_only
set to True
, the new pairwise2 can do two sequences of ~8400 chars in 6 sec.
Upvotes: 1
Reputation: 2852
First, I used BioPython's needle for that. A nice howto (ignore the legacy design :-) ) can be found here
Second: maybe you can avoid getting the whole set into memory by using a generator? I do not know where your 'whole_coding' object is coming from. But, if it is a file, make sure you do not read the whole file, and then iterate over the memory object. For example:
whole_coding = open('big_file', 'rt').readlines() # Will consume memory
but
for gene in open('big_file', 'rt'): # will not read the whole thing into memory first
process(gene)
If your needs processing, you could write a generator funtion:
def gene_yielder(filename):
for line in open('filename', 'rt'):
line.strip() # Here you preprocess your data
yield line # This will return
then
for gene in gene_yielder('big_file'):
process_gene(gene)
Basically, you want your program to act as a pipe: things flow through it, and get processed. Do not use it as cooking pot when preparing bouillon: adding everything, and apply heat. I hope this comparison is not to far fetched :-)
Upvotes: 4
Reputation: 197
For global alignment could try NWalign https://pypi.python.org/pypi/nwalign/. I have not used it but seems like you can recover the alignment score within your script.
Otherwise perhaps the EMBOSS tools could help: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html
Upvotes: 2