Reputation: 914
I would like to find the algorithm which circumvent some drawbacks of k-Means:
Given:
x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)
matrix<-cbind(x,y)# defining matrix
Kmeans<-kmeans(matrix,centers=2) # with 3 centroids
plot(x,y,col=Kmeans$cluster,pch=19,cex=2)
points(Kmeans$centers,col=1:3,pch=3,cex=3,lwd=3)
Here I would like have an algorithm clustering the data into two clusters divided by a diagonal from left corner to right corner.
Upvotes: 1
Views: 324
Reputation: 9656
What you are asking for can be solved in multiple ways. Here are two:
If you want your line to start at the origin, then simply check if x > y:
x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)
thePoints <- cbind(x,y)
as.integer(thePoints[,1] > thePoints[,2])
[1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
This will put points above the diagonal (starting at 0) in one group, and others - to another group. Keep in mind that if your line may not go through the origin (0) then you have to modify this example a bit.
The K-means function:
myKmeans <- function(x, centers, distFun, nItter=10) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
And correlation distance:
myCor <- function(points1, points2) {
return(1 - ((cor(t(points1), t(points2))+1)/2))
}
theResult <- myKmeans(mat, centers, myCor, 10)
As was also displayed HERE
Here how both solution would look like:
plot(points, col=as.integer(points[,1] > points[,2])+1, main="Using a line", xlab="x", ylab="y")
plot(points, col=theResult$clusters[[10]], main="K-means with correlation clustering", xlab="x", ylab="y")
points(theResult$centers[[10]], col=1:2, cex=3, pch=19)
So it's more about what kind of distance measure you are using and not about some kind of deficiency of K-means.
You can also find better implementations of K-means with correlation distance for R instead of using the one I provided here.
Upvotes: 0
Reputation: 3218
Try Mclust
from the mclust
package, it will try to fit a Gaussian mixture on your data.
The default behavior:
mc = Mclust(matrix);
points(t(mc$parameters$mean));
plot(mc);
.. will find 4 groups, but you might be able to force it to 2 or to force the correlation structure for the Gaussians to be stretched in the right direction.
Be aware that it'll be hard to interpret and justify the meaning of your groups unless you understand very well the reason why you want them to be 2 etc..
Upvotes: 1