K-means and Mahalanobis distance

Question

I'd like to use the Mahalanobis distance in the K-means algorithm, because I have 4 variables which are highly correlated (0.85)

It appears to me that it's better to use the Mahalanobis distance in this case.

The problem is I don't know how to implement it in R, with the K-means algorithm.

I think I need to "fake" it in transform the data before the clustering step, but I don't know how.

I tried the classical kmeans, with the euclidian distance on standardize data, but as I said, there is too much correlation.

fit <- kmeans(mydata.standardize, 4)

I also tried to find a distance parameter, but I think it doesn't exist in the kmeans() function.

The expected result is a way to applied the K-means algorithm with the Mahalanobis distance.

Vincent Zoonekynd · Accepted Answer

You can rescale the data before running the algorithm, using the Cholesky decomposition of the variance matrix: the Euclidian distance after the transformation is the Mahalanobis distance before.

# Sample data 
n <- 100
k <- 5
x <- matrix( rnorm(k*n), nr=n, nc=k )
x[,1:2] <- x[,1:2] %*% matrix( c(.9,1,1,.9), 2, 2 )
var(x)

# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix

kmeans(y, 4)

But this assumes that all the clusters have the same shape and orientations as the whole data. If this is not the case, you may want to look at models that explicitly allow for elliptical clusters, e.g., in the mclust package.

K-means and Mahalanobis distance

Answers (2)

Related Questions