slaw
slaw

Reputation: 6879

Vectorize Operations in Numpy for Two Dependent Arrays

I have an n x n numpy array that contains all pairwise distances and another 1 x n array that contains some scoring metric.

Example:

import numpy as np
import scipy.spatial.distance

dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))

array([[  0. ,   3.2,   4.1,   8.8,   0.6],
       [  3.2,   0. ,   1.5,   9. ,   5. ],
       [  4.1,   1.5,   0. ,   9.9,  10. ],
       [  8.8,   9. ,   9.9,   0. ,   1.1],
       [  0.6,   5. ,  10. ,   1.1,   0. ]])

score = np.array([19., 1.3, 4.8, 6.2, 5.7])

array([ 19. ,   1.3,   4.8,   6.2,   5.7])

So, note that the ith element of the score array corresponds to the ith row of the distance array.

What I need to do is vectorize this process:

  1. For the ith value in the score array, find all other values that are larger than the ith value and note their indices
  2. Then, in the ith row of the distance array, get all of the distances with the same indices as noted in step 1. above and return the smallest distance
  3. In the case where the ith value in the score array is the largest, then the smallest distance is set as the largest distance found in the distance array

Here is an un-vectorized version:

n = score.shape[0]
min_dist = np.full(n, np.max(dists))
for i in range(score.shape[0]):
    inx = numpy.where(score > score[i])
    if len(inx[0]) > 0:
        min_dist[i] = np.min(dists[i, inx])

min_dist

array([ 10. ,   1.5,   4.1,   8.8,   0.6])

This works but is pretty inefficient in terms of speed and my arrays are expected to be much, much larger. I am hoping to improve the efficiency by using faster vectorized operations to achieve the same result.

Update: Based on Oliver W.'s answer, I came up with my own that doesn't require making a copy of the distance array

def new_method (dists, score):
    mask = score > score.reshape(-1,1)
    return np.ma.masked_array(dists, mask=~mask).min(axis=1).filled(dists.max()) 

One could in theory make it a one-liner but it's already a bit challenging to read to the untrained eye.

Upvotes: 2

Views: 255

Answers (1)

Oliver W.
Oliver W.

Reputation: 13459

One possible vectorized solution is given below.

import numpy as np
import scipy.spatial.distance

dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
score = np.array([19., 1.3, 4.8, 6.2, 5.7])

def your_method(dists, score):
    dim = score.shape[0]
    min_dist = np.full(dim, np.max(dists))
    for i in range(dim):
        inx = np.where(score > score[i])
        if len(inx[0]) > 0:
            min_dist[i] = np.min(dists[i, inx])
    return min_dist

def vectorized_method_v1(dists, score):
    mask = score > score.reshape(-1,1)
    dists2 = dists.copy()  # get rid of this in case the dists array can be changed
    dists2[np.logical_not(mask)] = dists.max()
    return dists2.min(axis=1)    

The speed gain is not so impressive for these small arrays (~factor of 3 on my machine), so I'll demonstrate with a larger set:

dists = scipy.spatial.distance.squareform(np.random.random(50*99))
score = np.random.random(dists.shape[0])
print(dists.shape)
%timeit your_method(dists, score)
%timeit vectorized_method_v1(dists, score)

## -- End pasted text --
(100, 100)
100 loops, best of 3: 2.98 ms per loop
10000 loops, best of 3: 125 µs per loop

Which is close to a factor of 24.

Upvotes: 1

Related Questions