memory efficient euclidean distance measurement

Question

I have 40,000 points and I need to find out the euclidean distance between each of the pairs. After going through the net, I found that the efficient way of calculating euclidean distance between pairs of points is by using scipy.spatial distance.cdist. But, since the no. of points is 40,000, the distance matirx will take around 12 GB of memory.

Is there a way of reducing the memory required to store the distance matrix without compromising the speed of calculating the same? Can the data type be change to float 32 instead of float 64 in the calculation of the distance matrix?

max9111 · Accepted Answer

cdist like approach

The output datatype is the same as given as input.

import numpy as np
import numba as nb

@nb.njit(fastmath=True,parallel=True)
def calc_distance(vec_1,vec_2):
    res=np.empty((vec_1.shape[0],vec_2.shape[0]),dtype=vec_1.dtype)
    for i in nb.prange(vec_1.shape[0]):
        for j in range(vec_2.shape[0]):
            res[i,j]=np.sqrt((vec_1[i,0]-vec_2[j,0])**2+(vec_1[i,1]-vec_2[j,1])**2+(vec_1[i,2]-vec_2[j,2])**2)

    return res

Aproach without repetitions

@nb.njit(fastmath=True)
def calc_distance_pairs(vec):
  res=np.empty(((vec.shape[0]**2)//2-vec.shape[0]//2),dtype=vec.dtype)

  ii=0
  for i in range(vec.shape[0]):
    for j in range(i+1,vec.shape[0]):
      res[ii]=np.sqrt((vec[i,0]-vec[j,0])**2+(vec[i,1]-vec[j,1])**2+(vec[i,2]-vec[j,2])**2)
      ii+=1

  return res

This cuts the amount of memory to less than 1/4 of the scipy cdist approach.

Timings

calc_distance: ~2s
calc_distance_pairs: ~3s
cdist: ~11s

memory efficient euclidean distance measurement

Answers (1)

Related Questions