Reputation: 377
I'd like to use the Mahalanobis distance in the K-means algorithm, because I have 4 variables which are highly correlated (0.85)
It appears to me that it's better to use the Mahalanobis distance in this case.
The problem is I don't know how to implement it in R, with the K-means algorithm.
I think I need to "fake" it in transform the data before the clustering step, but I don't know how.
I tried the classical kmeans, with the euclidian distance on standardize data, but as I said, there is too much correlation.
fit <- kmeans(mydata.standardize, 4)
I also tried to find a distance parameter, but I think it doesn't exist in the kmeans() function.
The expected result is a way to applied the K-means algorithm with the Mahalanobis distance.
Upvotes: 3
Views: 9450
Reputation: 61
You can see in page 10 of Brian S. Everitt book -"An R and S-PLUS® Companion to Multivariate Analysis", the formula for Mahalanobis distance. Euclidean distance is one special case of mahalanobis, when the sample covariance is identity matrix. Then the euclidean distance with rescaled data in 'y', is mahalanobis.
# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix
Upvotes: 0
Reputation: 32351
You can rescale the data before running the algorithm, using the Cholesky decomposition of the variance matrix: the Euclidian distance after the transformation is the Mahalanobis distance before.
# Sample data
n <- 100
k <- 5
x <- matrix( rnorm(k*n), nr=n, nc=k )
x[,1:2] <- x[,1:2] %*% matrix( c(.9,1,1,.9), 2, 2 )
var(x)
# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix
kmeans(y, 4)
But this assumes that all the clusters have the same shape and orientations as the whole data.
If this is not the case, you may want to look at models that explicitly allow for elliptical clusters,
e.g., in the mclust
package.
Upvotes: 14