Numpy and Biopython must be integrated?

Question

For example... I have two scripts for look if a (Multiple Sequence Alignment) MSA has more than 50 columns with less than 50% of gaps.

The first using BioPython takes 4.2 seconds in a MSA of 16281 sequences with 609 columns (PF00085 of Pfam in fasta format). [ getitem method of the Multiple Sequence Alignment object of Biopython consumes a lot of time ]

The second using a simple IO for generate a 2D Numpy Array with the MSA, takes only 1.2 second in the same Alignment.

I think that a Numpy approach to MSA objects can be more useful and faster. For example, you can use a boolean numpy array to select specific rows and columns. Actually deletion and selection of columns (for example for eliminates columns with more tha 50% of gaps) is very time consuming and not well implemented in Biopython. I think this can be useful to for a nx3 numpy array for PDB coordinates, too.

I have five ideas, maybe only one or two of them can be useful:

1 - Create a Seq and Multiple Sequence Alignment object (Bio.Align.MultipleSeqAlignment) based on numpy instead of str. This can be a problem for compatibility... Maybe it is NOT a good idea. I dont't know.

2 - Create a faster method in Biopython for obtain numpy arrays versions from Biopython objects. I try to generate numpy arrays for Multiple Sequence Alignment object, but this do multiple calls to getitem method, and it's more time consuming than using Biopython alone. But, maybe someone with more programming skills can do something better.

3 - Create a module for numpy or scipy with IO support for Alignments and PDB. Maybe the more simple and useful idea.

4 - Create another complete Bio module but based on numpy. Maybe inside of scipy or numpy.

5 - Like ideas 2 and 3, creating modules and methods for faster and efficient compatibility between Biopython and numpy objects.

What do you think? What of the ideas is better? Do you have some better idea? Can be possible do something? I want to colaborate with Biopython project... I think than integration with numpy can be a good start.

A lot of thanks ;)

P.D.: My two scripts... The slow, based on Biopython:

#!/usr/bin/python2.7

from sys import argv
from Bio import AlignIO
aln = AlignIO.read(open(argv[1],"r"), "fasta")
longitud = aln.get_alignment_length()
if longitud > 150:
    corte = 0.5 * len(aln)
    j = 0
    i = 0
    while j<50 and i=50:
        print argv[1]

And the fastest based on numpy array:

#!/usr/bin/python2.7

from sys import argv
import numpy as np

with open(argv[1],'r') as archivo:
    secuencias=[]
    identificadores=[]
    temp=[]
    for linea in archivo:
        if linea[0]=='>':
            identificadores.append(linea[1:].replace('
',''))
            secuencias.append(list(temp))
            temp=""
        else:
            temp += linea.replace('
','')
    secuencias.append(list(temp))

sec = np.array(secuencias[1:])
ide = np.array(identificadores)

if len(ide)>150:
    corte = len(ide) * 0.5
    if np.sum(np.sum(sec=='-',1) < corte) >= 50:
        print argv[1]

Peter Cock · Accepted Answer

If you are going to be doing lots of operations on MSA objects where it is useful to treat them as arrays of characters, then I would just use Biopython's AlignIO to load the alignment and then convert it into a NumPy array of characters. For example:

import numpy as nump
from Bio import AlignIO
filename = "opuntia.aln"
format = "clustal"
alignment = AlignIO.read(filename, format)
align_array = numpy.array([list(rec) for rec in alignment], numpy.character)

That quick example could easily be added to the alignment object as a to_array method, or included in the tutorial. It is helpful?

Granted you are still paying the overhead of all the object creation (Seq objects, SeqRecord objects, empty annotation dictionaries, the alignment object etc) but that is the downside to the AlignIO interface - it works on a relatively heavy object model. This isn't really needed on simple formats like FASTA and Clustal, but is more useful with rich alignment formats like Stockholm.

Numpy and Biopython must be integrated?

I have five ideas, maybe only one or two of them can be useful:

Answers (1)

Related Questions