Reputation: 2061
I've got this data that looks the following.
[column 1] [column 2] [column 3] [column 4] [column 5]
[row 1] (some value)
[row 2]
[row 3]
...
[row 700 000]
and a 2nd data set that looks exactly the same, but with fewer rows of about 4.
What i would like to do is to calculate the euclidean distance between each data in the data-set 1 and 2 and find the minimum value of the 4 as seen here:
This is then repeated for the rest of the 700000 rows
of data. I know its not advisable to iterate through numpy
arrays, hence is there any way to calculate the minimum distance of the 4 different rows from data-set 2 fed into 1 row of data-set 1?
Apologies if this is confusing, but my main points is that I do not wish to iterate through the array and I'm trying to find a better way to table this problem.
In the end, i should obtain back a 700 000 row by 1 column data with the best(lowest) value of the 4 green boxes of the data set 2.
import numpy as np
a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] ] )
def euc_distance(array1, array2):
return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this prints out [0 2 4]
However, when i tried to input more than 1 dimension,
a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] , [2,2,2,2] ] )
def euc_distance(array1, array2):
return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this throws back an error as the dimensions are not the same
I am looking for a way to make it into sort of a 3D array where i get the array of [[euc_dist([1,1,1,1],[1,1,1,1]), euc_dist([1,1,1,1],[2,2,2,2])] , ... ]
Upvotes: 1
Views: 1672
Reputation: 36839
You can use broadcasting for this:
a = np.array([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]
])
b = np.array([
[1,1,1,1],
[2,2,2,2]
])
def euc_distance(array1, array2):
return np.sqrt(np.sum((array1 - array2)**2, axis = -1))
print(euc_distance(a[None, :, :], b[:, None, :]))
# [[0. 2. 4.]
# [2. 0. 2.]]
Comparing the times for a dataset of your size:
a = np.random.rand(700000, 4)
b = np.random.rand(4, 4)
c = euc_distance(a[None, :, :], b[:, None, :])
d = np.array([euc_distance(a, val) for val in b])
e = np.array([euc_distance(val, b) for val in a]).T
np.allclose(c, d)
# True
np.allclose(d, e)
# True
%timeit euc_distance(a[None, :, :], b[:, None, :])
# 113 ms ± 4.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.array([euc_distance(a, val) for val in b])
# 115 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.array([euc_distance(val, b) for val in a])
# 7.03 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 125
Couldn't test it, but this should get you there assuming normalised positive data. np.argmax(np.matmul(a, b.T), axis=1)
Little elaboration of my previous post. If performance is still an issue, instead of your approach you can use this:
b = np.tile(b, (a.shape[0], 1, 1))
a = np.tile(a, (1, 1, b.shape[1])).reshape(b.shape)
absolute_dist = np.sqrt(np.sum(np.square(a - b), axis=2))
It produces the exact same result but runs about 20 times faster on 600,000 lines than the generator.
Upvotes: 1
Reputation: 2061
Thanks for everyone's help, however i think I've managed to solve my own problem by using a simple list comprehension. I was over-complicating things! By doing so, instead of iterating each data, i have essentially cut-down more than half of the time which increases exponentially.
What i did was the following
c = np.array( [euc_distance(val, b) for val in a])
who knew this problem could have such a simple solution!
Upvotes: 0