Efficiently updating distance between points

Question

I have a data set that has n rows (observations) and p columns (features):

import numpy as np
from scipy.spatial.distance import pdist, squareform
p = 3
n = 5
xOld = np.random.rand(n * p).reshape([n, p])

I am interested to get the distance between these points in a nxn matrix that really has n x (n-1)/2 unique values

sq_dists = pdist(xOld, 'sqeuclidean')
D_n = squareform(sq_dists)

Now imagine I get N additional observations and would like to update D_n. One very inefficient way is:

N = 3
xNew = np.random.rand(N * p).reshape([N, p])
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)

However, considering that n ~ 10000 and N ~ 100, this will be redundant. My goal is to get D_n_N more efficiently using D_n. In order to do that, I am dividing D_n_N as follows. I already have D_n and can calculate B [N x N]. However, I am wondering if there is a good way to calculate A (or A transpose) without bunch of for loops and finally construct D_n_N

D_n (n x n)    A [n x N]
A.T [N x n]    B [N x N]

Thanks in advance.

Divakar · Accepted Answer

Pretty interesting problem! Well I got to learn few new things here on the way to getting a solution on this.

Steps involved :

First off, we are introducing new pts here. So, we need to use cdist to get squared euclidean distances between the old and new pts. These would be accommodated in two blocks in the new output, one right below the old distances and one to the right of those old ones.
We also need to compute the pdist among the new pts and put its square-formed block along the trailing part of the new diagonal region.

Schematically put new D_n_N would look like this :

[   D_n      cdist.T
  cdist      New pdist squarefomed]

Summing up, the implementation would look something along these lines -

cdists = cdist( xNew, xOld, 'sqeuclidean')

n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))

Runtime test

Approaches -

# Original approach
def org_app(D_n, xNew):
    sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
    D_n_N = squareform(sq_dists)
    return D_n_N    

# Proposed approach
def proposed_app(D_n, xNew, N):
    cdists = cdist( xNew, xOld, 'sqeuclidean')    
    n1 = D_n.shape[0]
    out = np.empty((n1+N,n1+N))
    out[:n1,:n1] = D_n
    out[n1:,:n1] = cdists
    out[:n1,n1:] = cdists.T
    out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
    return out

Timings -

In [102]: # Setup inputs
     ...: p = 3
     ...: n = 5000
     ...: xOld = np.random.rand(n * p).reshape([n, p])
     ...: 
     ...: sq_dists = pdist(xOld, 'sqeuclidean')
     ...: D_n = squareform(sq_dists)
     ...: 
     ...: N = 3000
     ...: xNew = np.random.rand(N * p).reshape([N, p])
     ...: 

In [103]: np.allclose( proposed_app(D_n, xNew, N), org_app(D_n, xNew))
Out[103]: True

In [104]: %timeit org_app(D_n, xNew)
1 loops, best of 3: 541 ms per loop

In [105]: %timeit proposed_app(D_n, xNew, N)
1 loops, best of 3: 201 ms per loop

Efficiently updating distance between points

Answers (2)

Related Questions