Reputation: 57
I have 40,000 points and I need to find out the euclidean distance between each of the pairs. After going through the net, I found that the efficient way of calculating euclidean distance between pairs of points is by using scipy.spatial distance.cdist. But, since the no. of points is 40,000, the distance matirx will take around 12 GB of memory.
Is there a way of reducing the memory required to store the distance matrix without compromising the speed of calculating the same? Can the data type be change to float 32 instead of float 64 in the calculation of the distance matrix?
Upvotes: 1
Views: 1451
Reputation: 6482
cdist like approach
The output datatype is the same as given as input.
import numpy as np
import numba as nb
@nb.njit(fastmath=True,parallel=True)
def calc_distance(vec_1,vec_2):
res=np.empty((vec_1.shape[0],vec_2.shape[0]),dtype=vec_1.dtype)
for i in nb.prange(vec_1.shape[0]):
for j in range(vec_2.shape[0]):
res[i,j]=np.sqrt((vec_1[i,0]-vec_2[j,0])**2+(vec_1[i,1]-vec_2[j,1])**2+(vec_1[i,2]-vec_2[j,2])**2)
return res
Aproach without repetitions
@nb.njit(fastmath=True)
def calc_distance_pairs(vec):
res=np.empty(((vec.shape[0]**2)//2-vec.shape[0]//2),dtype=vec.dtype)
ii=0
for i in range(vec.shape[0]):
for j in range(i+1,vec.shape[0]):
res[ii]=np.sqrt((vec[i,0]-vec[j,0])**2+(vec[i,1]-vec[j,1])**2+(vec[i,2]-vec[j,2])**2)
ii+=1
return res
This cuts the amount of memory to less than 1/4 of the scipy cdist approach.
Timings
calc_distance: ~2s
calc_distance_pairs: ~3s
cdist: ~11s
Upvotes: 2