Reputation: 2061

Nested loop for python numpy arrays

I've got this data that looks the following.

                [column 1]   [column 2]   [column 3]   [column 4]   [column 5]
[row 1]        (some value)
[row 2]
[row 3]
...
[row 700 000]

and a 2nd data set that looks exactly the same, but with fewer rows of about 4. What i would like to do is to calculate the euclidean distance between each data in the data-set 1 and 2 and find the minimum value of the 4 as seen here:

This is then repeated for the rest of the 700000 rows of data. I know its not advisable to iterate through numpy arrays, hence is there any way to calculate the minimum distance of the 4 different rows from data-set 2 fed into 1 row of data-set 1?

Apologies if this is confusing, but my main points is that I do not wish to iterate through the array and I'm trying to find a better way to table this problem.

In the end, i should obtain back a 700 000 row by 1 column data with the best(lowest) value of the 4 green boxes of the data set 2.

import numpy as np

a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] ] )

def euc_distance(array1, array2):
    return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this prints out [0 2 4]

However, when i tried to input more than 1 dimension,

a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] , [2,2,2,2] ] )

def euc_distance(array1, array2):
    return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this throws back an error as the dimensions are not the same

I am looking for a way to make it into sort of a 3D array where i get the array of [[euc_dist([1,1,1,1],[1,1,1,1]), euc_dist([1,1,1,1],[2,2,2,2])] , ... ]

Upvotes: 1

Answers (3)

Nils Werner

Reputation: 36839

You can use broadcasting for this:

a = np.array([
    [1,1,1,1],
    [2,2,2,2],
    [3,3,3,3]
])
b = np.array([
    [1,1,1,1],
    [2,2,2,2]
])

def euc_distance(array1, array2):
    return np.sqrt(np.sum((array1 - array2)**2, axis = -1))

print(euc_distance(a[None, :, :], b[:, None, :]))
# [[0. 2. 4.]
#  [2. 0. 2.]]

Comparing the times for a dataset of your size:

a = np.random.rand(700000, 4)
b = np.random.rand(4, 4)

c = euc_distance(a[None, :, :], b[:, None, :])
d = np.array([euc_distance(a, val) for val in b])
e = np.array([euc_distance(val, b) for val in a]).T

np.allclose(c, d)
# True
np.allclose(d, e)
# True

%timeit euc_distance(a[None, :, :], b[:, None, :])
# 113 ms ± 4.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.array([euc_distance(a, val) for val in b])
# 115 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.array([euc_distance(val, b) for val in a])
# 7.03 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

lmielke

Reputation: 125

Couldn't test it, but this should get you there assuming normalised positive data. np.argmax(np.matmul(a, b.T), axis=1)

Little elaboration of my previous post. If performance is still an issue, instead of your approach you can use this:

b = np.tile(b, (a.shape[0], 1, 1))
a = np.tile(a, (1, 1, b.shape[1])).reshape(b.shape)
absolute_dist = np.sqrt(np.sum(np.square(a - b), axis=2))

It produces the exact same result but runs about 20 times faster on 600,000 lines than the generator.

Upvotes: 1

Axois

Reputation: 2061

Thanks for everyone's help, however i think I've managed to solve my own problem by using a simple list comprehension. I was over-complicating things! By doing so, instead of iterating each data, i have essentially cut-down more than half of the time which increases exponentially.

What i did was the following c = np.array( [euc_distance(val, b) for val in a]) who knew this problem could have such a simple solution!

Upvotes: 0

Nested loop for python numpy arrays

Answers (3)

Related Questions