Reputation: 1360
I have a data set that has n rows (observations) and p columns (features):
import numpy as np
from scipy.spatial.distance import pdist, squareform
p = 3
n = 5
xOld = np.random.rand(n * p).reshape([n, p])
I am interested to get the distance between these points in a nxn
matrix that really has n x (n-1)/2
unique values
sq_dists = pdist(xOld, 'sqeuclidean')
D_n = squareform(sq_dists)
Now imagine I get N
additional observations and would like to update D_n
. One very inefficient way is:
N = 3
xNew = np.random.rand(N * p).reshape([N, p])
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
However, considering that n ~ 10000 and N ~ 100, this will be redundant. My goal is to get D_n_N
more efficiently using D_n
. In order to do that, I am dividing D_n_N as follows. I already have D_n
and can calculate B [N x N]
. However, I am wondering if there is a good way to calculate A (or A transpose) without bunch of for loops and finally construct D_n_N
D_n (n x n) A [n x N]
A.T [N x n] B [N x N]
Thanks in advance.
Upvotes: 2
Views: 294
Reputation: 221704
Pretty interesting problem! Well I got to learn few new things here on the way to getting a solution on this.
Steps involved :
First off, we are introducing new pts here. So, we need to use cdist
to get squared euclidean distances between the old and new pts. These would be accommodated in two blocks in the new output, one right below the old distances and one to the right of those old ones.
We also need to compute the pdist
among the new pts and put its square-formed
block along the trailing part of the new diagonal region.
Schematically put new D_n_N
would look like this :
[ D_n cdist.T
cdist New pdist squarefomed]
Summing up, the implementation would look something along these lines -
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
Runtime test
Approaches -
# Original approach
def org_app(D_n, xNew):
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
return D_n_N
# Proposed approach
def proposed_app(D_n, xNew, N):
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
return out
Timings -
In [102]: # Setup inputs
...: p = 3
...: n = 5000
...: xOld = np.random.rand(n * p).reshape([n, p])
...:
...: sq_dists = pdist(xOld, 'sqeuclidean')
...: D_n = squareform(sq_dists)
...:
...: N = 3000
...: xNew = np.random.rand(N * p).reshape([N, p])
...:
In [103]: np.allclose( proposed_app(D_n, xNew, N), org_app(D_n, xNew))
Out[103]: True
In [104]: %timeit org_app(D_n, xNew)
1 loops, best of 3: 541 ms per loop
In [105]: %timeit proposed_app(D_n, xNew, N)
1 loops, best of 3: 201 ms per loop
Upvotes: 2
Reputation: 18668
Just use cdist :
D_OO=cdist(xOld,xOld)
D_NN=cdist(xNew,xNew)
D_NO=cdist(xNew,xOld)
D_ON=cdist(xOld,xNew) # or D_NO.T
And finally :
D_=vstack((hstack((D_OO,D_ON)),(hstack((D_NO,D_NN)))))
Upvotes: 1