Borys
Borys

Reputation: 1423

Calculate Mahalanobis distance using NumPy only

I am looking for NumPy way of calculating Mahalanobis distance between two numpy arrays (x and y). The following code can correctly calculate the same using cdist function of Scipy. Since this function calculates unnecessary matix in my case, I want more straight way of calculating it using NumPy only.

import numpy as np
from scipy.spatial.distance import cdist

x = np.array([[[1,2,3,4,5],
               [5,6,7,8,5],
               [5,6,7,8,5]],
              [[11,22,23,24,5],
               [25,26,27,28,5],
               [5,6,7,8,5]]])
i,j,k = x.shape

xx = x.reshape(i,j*k).T


y = np.array([[[31,32,33,34,5],
               [35,36,37,38,5],
               [5,6,7,8,5]],
              [[41,42,43,44,5],
               [45,46,47,48,5],
               [5,6,7,8,5]]])


yy = y.reshape(i,j*k).T

results =  cdist(xx,yy,'mahalanobis')
results = np.diag(results)
print results



[ 2.28765854  2.75165028  2.75165028  2.75165028  0.          2.75165028
  2.75165028  2.75165028  2.75165028  0.          0.          0.          0.
  0.          0.        ]

My trial:

VI = np.linalg.inv(np.cov(xx,yy))

print np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T))

Could anybody correct this method?

Here is formula for it:

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.mahalanobis.html#scipy.spatial.distance.mahalanobis

Upvotes: 20

Views: 35866

Answers (2)

David
David

Reputation: 91

Another simple solution which is just as fast as the einsum

e = xx-yy
X = np.vstack([xx,yy])
V = np.cov(X.T) 
p = np.linalg.inv(V)
D = np.sqrt(np.sum(np.dot(e,p) * e, axis = 1))

Upvotes: 8

xnx
xnx

Reputation: 25508

I think your problem lies in the construction of your covariance matrix. Try:

X = np.vstack([xx,yy])
V = np.cov(X.T)
VI = np.linalg.inv(V)
print np.diag(np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T)))

Output:

[ 2.28765854  2.75165028  2.75165028  2.75165028  0.          2.75165028
  2.75165028  2.75165028  2.75165028  0.          0.          0.          0.
  0.          0.        ]

To do this without the intermediate array implicitly created here, you might have to sacrifice a C loop for a Python one:

A = np.dot((xx-yy),VI)
B = (xx-yy).T
n = A.shape[0]
D = np.empty(n)
for i in range(n):
    D[i] = np.sqrt(np.sum(A[i] * B[:,i]))

EDIT: actually, with np.einsum voodoo you can remove the Python loop and speed it up a lot (on my system, from 84.3 µs to 2.9 µs):

D = np.sqrt(np.einsum('ij,ji->i', A, B))

EDIT: As @Warren Weckesser points out, einsum can be used to do away with the intermediate A and B arrays too:

delta = xx - yy
D = np.sqrt(np.einsum('nj,jk,nk->n', delta, VI, delta))

Upvotes: 27

Related Questions