RS_0095
RS_0095

Reputation: 33

Iterating operation with two arrays using numpy

I'm working with two different arrays (75x4), and I'm applying a shortest distance algorithm between the two arrays.

So I want to:

How would I go about doing this with numpy?

Essentially I want to perform an operation between one row of array 1 on every row of array 2, find the minimum value, and store that in a new array. Then do that very same thing for the 2nd row of array 1, and so on for all 75 rows of array 1.

Here is the code for the formula I'm using. What I get here is just the distance between every row of array 1 (training data) and array 2 (testing data). But what I'm looking for is to do it for one row of array 1 iterating down for all rows of array 2, storing the minimum value in a new array, then doing the same for the next row of array 1, and so on.

arr_attributedifference = (arr_trainingdata - arr_testingdata)**2
arr_distance = np.sqrt(arr_attributedifference.sum(axis=1))

Upvotes: 1

Views: 102

Answers (1)

Paul Panzer
Paul Panzer

Reputation: 53099

Here are two methods one using einsum, the other KDTree:

einsum does essentially what we could also achieve via broadcasting, for example np.einsum('ik,jk', A, B) is roughly equivalent to (A[:, None, :] * B[None, :, :]).sum(axis=2). The advantage of einsum is that it does the summing straight away, so it avoids creating an mxmxn intermediate array.

KDTree is more sophisticated. We have to invest upfront into generating the tree but afterwards querying nearest neighbors is very efficient.

import numpy as np
from scipy.spatial import cKDTree as KDTree

def f_einsum(A, B):
    B2AB = np.einsum('ij,ij->i', B, B) / 2 - np.einsum('ik,jk', A, B)
    idx = B2AB.argmin(axis=1)
    D = A-B[idx]
    return np.sqrt(np.einsum('ij,ij->i', D, D)), idx

def f_KDTree(A, B):
    T = KDTree(B)
    return T.query(A, 1)

m, n = 75, 4
A, B = np.random.randn(2, m, n)

de, ie = f_einsum(A, B)
dt, it = f_KDTree(A, B)
assert np.all(ie == it) and np.allclose(de, dt)

from timeit import timeit

for m, n in [(75, 4), (500, 4)]:
    A, B = np.random.randn(2, m, n)
    print(m, n)
    print('einsum:', timeit("f_einsum(A, B)", globals=globals(), number=1000))
    print('KDTree:', timeit("f_KDTree(A, B)", globals=globals(), number=1000))

Sample run:

75 4
einsum: 0.067826496087946
KDTree: 0.12196151306852698
500 4
einsum: 3.1056990439537913
KDTree: 0.85108971898444

We can see that at small problem size the direct method (einsum) is faster while for larger problem size KDTree wins.

Upvotes: 3

Related Questions