Reputation: 33
I'm working with two different arrays (75x4), and I'm applying a shortest distance algorithm between the two arrays.
So I want to:
How would I go about doing this with numpy?
Essentially I want to perform an operation between one row of array 1 on every row of array 2, find the minimum value, and store that in a new array. Then do that very same thing for the 2nd row of array 1, and so on for all 75 rows of array 1.
Here is the code for the formula I'm using. What I get here is just the distance between every row of array 1 (training data) and array 2 (testing data). But what I'm looking for is to do it for one row of array 1 iterating down for all rows of array 2, storing the minimum value in a new array, then doing the same for the next row of array 1, and so on.
arr_attributedifference = (arr_trainingdata - arr_testingdata)**2
arr_distance = np.sqrt(arr_attributedifference.sum(axis=1))
Upvotes: 1
Views: 102
Reputation: 53099
Here are two methods one using einsum
, the other KDTree
:
einsum
does essentially what we could also achieve via broadcasting, for example np.einsum('ik,jk', A, B)
is roughly equivalent to (A[:, None, :] * B[None, :, :]).sum(axis=2)
. The advantage of einsum is that it does the summing straight away, so it avoids creating an mxmxn intermediate array.
KDTree
is more sophisticated. We have to invest upfront into generating the tree but afterwards querying nearest neighbors is very efficient.
import numpy as np
from scipy.spatial import cKDTree as KDTree
def f_einsum(A, B):
B2AB = np.einsum('ij,ij->i', B, B) / 2 - np.einsum('ik,jk', A, B)
idx = B2AB.argmin(axis=1)
D = A-B[idx]
return np.sqrt(np.einsum('ij,ij->i', D, D)), idx
def f_KDTree(A, B):
T = KDTree(B)
return T.query(A, 1)
m, n = 75, 4
A, B = np.random.randn(2, m, n)
de, ie = f_einsum(A, B)
dt, it = f_KDTree(A, B)
assert np.all(ie == it) and np.allclose(de, dt)
from timeit import timeit
for m, n in [(75, 4), (500, 4)]:
A, B = np.random.randn(2, m, n)
print(m, n)
print('einsum:', timeit("f_einsum(A, B)", globals=globals(), number=1000))
print('KDTree:', timeit("f_KDTree(A, B)", globals=globals(), number=1000))
Sample run:
75 4
einsum: 0.067826496087946
KDTree: 0.12196151306852698
500 4
einsum: 3.1056990439537913
KDTree: 0.85108971898444
We can see that at small problem size the direct method (einsum) is faster while for larger problem size KDTree wins.
Upvotes: 3