Quant Metropolis
Quant Metropolis

Reputation: 2672

numpy indexing with multiple arrays

Given two sequences of data (of equal length) and quality values for each data point, I want to calculate a similarity score based upon a given scoring matrix.

What is the most efficient way to vectorize the following loop:

score = 0
for i in xrange(len(seq1)):
    score += similarity[seq1[i], seq2[i], qual1[i], qual2[i]]

similarity is a 4-dimensional float array, shape=(32, 32, 100, 100); seq1, seq2, qual1 and qual2 are 1-dimensional int arrays of equal length (of the order 1000 - 40000).

Upvotes: 0

Views: 124

Answers (2)

DSM
DSM

Reputation: 353049

Shouldn't this Just Work(tm)?

>>> score = 0
>>> for i in xrange(len(seq1)):
        score += similarity[seq1[i], seq2[i], qual1[i], qual2[i]]
...     
>>> score
498.71792400493433
>>> similarity[seq1,seq2, qual1, qual2].sum()
498.71792400493433

Code:

import numpy as np

similarity = np.random.random((32, 32, 100, 100))
n = 1000
seq1, seq2, qual1, qual2 = [np.random.randint(0, s, n) for s in similarity.shape]

def slow():
    score = 0
    for i in xrange(len(seq1)):
        score += similarity[seq1[i], seq2[i], qual1[i], qual2[i]]
    return score

def fast():
    return similarity[seq1, seq2, qual1, qual2].sum()

gives:

>>> timeit slow()
100 loops, best of 3: 3.59 ms per loop
>>> timeit fast()
10000 loops, best of 3: 143 us per loop
>>> np.allclose(slow(),fast())
True

Upvotes: 3

John Zwinck
John Zwinck

Reputation: 249133

How about this?

score = numpy.sum(map(similarity.__getitem__, zip(seq1, seq2, qual1, qual2)))

Of course you can try with itertools imap and izip too. The zip is necessary because __getitem__ takes a single tuple rather than four numbers...maybe that can be improved somehow by looking in a darker corner of the itertools module.

Upvotes: 0

Related Questions