Reputation: 1423
I am looking for NumPy way of calculating Mahalanobis distance between two numpy arrays (x and y). The following code can correctly calculate the same using cdist function of Scipy. Since this function calculates unnecessary matix in my case, I want more straight way of calculating it using NumPy only.
import numpy as np
from scipy.spatial.distance import cdist
x = np.array([[[1,2,3,4,5],
[5,6,7,8,5],
[5,6,7,8,5]],
[[11,22,23,24,5],
[25,26,27,28,5],
[5,6,7,8,5]]])
i,j,k = x.shape
xx = x.reshape(i,j*k).T
y = np.array([[[31,32,33,34,5],
[35,36,37,38,5],
[5,6,7,8,5]],
[[41,42,43,44,5],
[45,46,47,48,5],
[5,6,7,8,5]]])
yy = y.reshape(i,j*k).T
results = cdist(xx,yy,'mahalanobis')
results = np.diag(results)
print results
[ 2.28765854 2.75165028 2.75165028 2.75165028 0. 2.75165028
2.75165028 2.75165028 2.75165028 0. 0. 0. 0.
0. 0. ]
My trial:
VI = np.linalg.inv(np.cov(xx,yy))
print np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T))
Could anybody correct this method?
Here is formula for it:
Upvotes: 20
Views: 35866
Reputation: 91
Another simple solution which is just as fast as the einsum
e = xx-yy
X = np.vstack([xx,yy])
V = np.cov(X.T)
p = np.linalg.inv(V)
D = np.sqrt(np.sum(np.dot(e,p) * e, axis = 1))
Upvotes: 8
Reputation: 25508
I think your problem lies in the construction of your covariance matrix. Try:
X = np.vstack([xx,yy])
V = np.cov(X.T)
VI = np.linalg.inv(V)
print np.diag(np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T)))
Output:
[ 2.28765854 2.75165028 2.75165028 2.75165028 0. 2.75165028
2.75165028 2.75165028 2.75165028 0. 0. 0. 0.
0. 0. ]
To do this without the intermediate array implicitly created here, you might have to sacrifice a C loop for a Python one:
A = np.dot((xx-yy),VI)
B = (xx-yy).T
n = A.shape[0]
D = np.empty(n)
for i in range(n):
D[i] = np.sqrt(np.sum(A[i] * B[:,i]))
EDIT: actually, with np.einsum
voodoo you can remove the Python loop and speed it up a lot (on my system, from 84.3 µs to 2.9 µs):
D = np.sqrt(np.einsum('ij,ji->i', A, B))
EDIT: As @Warren Weckesser points out, einsum
can be used to do away with the intermediate A
and B
arrays too:
delta = xx - yy
D = np.sqrt(np.einsum('nj,jk,nk->n', delta, VI, delta))
Upvotes: 27