Beginner
Beginner

Reputation: 2886

Biopython Global Alignment : Out of Memory

Im trying the global alignment method from the Biopython module. Using it on short sequences is easy and gives an alignment matrix straightaway. However I really need to run it on larger sequences I have (an average lenght of 2000 nucleatides (or) characters). However I keep running into the Out of Memory error. I looked on SO and found this previous question. The answers provided are not helpfull as they link to the this same website which cant be accessed now.Apart from this I have tried these steps:

  1. I tried using a 64-bit python since my personal computer has 4gb RAM.
  2. sshed to a small school server with 16gb RAM and tried running on that. Its still running after close to 4 hours.

Since its is a small script im unsure how to modify it. ANy help will be greatly appreciated.

My script:

import os
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

file_list = []

file_list = [each for each in os.listdir(os.getcwd()) if each.endswith(".dna")]

align_file = open("seq_align.aln","w")

seq_list = []

for each_file in file_list:
    f_o = open(each_file,"r")
    seq_list.append(f_o.read())

for a in pairwise2.align.globalmx(seq_list[0],seq_list[1]):
    align_file.write(format_alignment(*a))

align_file.close()

Upvotes: 2

Views: 818

Answers (1)

Beginner
Beginner

Reputation: 2886

So the school server finally completed the task. What I realized was that for each alignment there were 1000 matrices constructed and calculated. The align.globalxx method has a variable MAX_ALIGNMENT which is by default set to 1000. Changing it via monkey patching dint really change anything . The documentation says that, the method tries all possible alignments (yes 1000) but in my case all the matrices have the same alignment score (as well as few test sequences I tried). Finally a small piece of comment in the documentation states that if you need only 1 score, use the optional parameter one_alignment_only which accepts a boolean value only. All i did was this:

for a in pairwise2.align.globalmx(seq_list[0],seq_list[1],one_alignment_only=True):
    align_file.write(format_alignment(*a))

This reduced the time considerably. However my PC still crashed so I assume this is a very memory intensive task and requires much more RAM (16gb on a small server). So probably a more efficient way of reading the sequences in the matrix should be thought of.

Upvotes: 3

Related Questions