Reputation: 6879
I have an n x n numpy array that contains all pairwise distances and another 1 x n array that contains some scoring metric.
Example:
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
array([[ 0. , 3.2, 4.1, 8.8, 0.6],
[ 3.2, 0. , 1.5, 9. , 5. ],
[ 4.1, 1.5, 0. , 9.9, 10. ],
[ 8.8, 9. , 9.9, 0. , 1.1],
[ 0.6, 5. , 10. , 1.1, 0. ]])
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
array([ 19. , 1.3, 4.8, 6.2, 5.7])
So, note that the ith element of the score array corresponds to the ith row of the distance array.
What I need to do is vectorize this process:
Here is an un-vectorized version:
n = score.shape[0]
min_dist = np.full(n, np.max(dists))
for i in range(score.shape[0]):
inx = numpy.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
min_dist
array([ 10. , 1.5, 4.1, 8.8, 0.6])
This works but is pretty inefficient in terms of speed and my arrays are expected to be much, much larger. I am hoping to improve the efficiency by using faster vectorized operations to achieve the same result.
Update: Based on Oliver W.'s answer, I came up with my own that doesn't require making a copy of the distance array
def new_method (dists, score):
mask = score > score.reshape(-1,1)
return np.ma.masked_array(dists, mask=~mask).min(axis=1).filled(dists.max())
One could in theory make it a one-liner but it's already a bit challenging to read to the untrained eye.
Upvotes: 2
Views: 255
Reputation: 13459
One possible vectorized solution is given below.
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
def your_method(dists, score):
dim = score.shape[0]
min_dist = np.full(dim, np.max(dists))
for i in range(dim):
inx = np.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
return min_dist
def vectorized_method_v1(dists, score):
mask = score > score.reshape(-1,1)
dists2 = dists.copy() # get rid of this in case the dists array can be changed
dists2[np.logical_not(mask)] = dists.max()
return dists2.min(axis=1)
The speed gain is not so impressive for these small arrays (~factor of 3 on my machine), so I'll demonstrate with a larger set:
dists = scipy.spatial.distance.squareform(np.random.random(50*99))
score = np.random.random(dists.shape[0])
print(dists.shape)
%timeit your_method(dists, score)
%timeit vectorized_method_v1(dists, score)
## -- End pasted text --
(100, 100)
100 loops, best of 3: 2.98 ms per loop
10000 loops, best of 3: 125 µs per loop
Which is close to a factor of 24.
Upvotes: 1