How to find the right cluster algorithm?

Question

I would like to find the algorithm which circumvent some drawbacks of k-Means:

Given:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

matrix<-cbind(x,y)# defining matrix
Kmeans<-kmeans(matrix,centers=2) # with 3 centroids

plot(x,y,col=Kmeans$cluster,pch=19,cex=2)
points(Kmeans$centers,col=1:3,pch=3,cex=3,lwd=3)

Here I would like have an algorithm clustering the data into two clusters divided by a diagonal from left corner to right corner.

Karolis Koncevičius · Accepted Answer

What you are asking for can be solved in multiple ways. Here are two:

First way is to simply define the separating line of you clusters. Since you know how your points should be grouped (by a line) you can use that.

If you want your line to start at the origin, then simply check if x > y:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

thePoints <- cbind(x,y)


as.integer(thePoints[,1] > thePoints[,2])
[1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

This will put points above the diagonal (starting at 0) in one group, and others - to another group. Keep in mind that if your line may not go through the origin (0) then you have to modify this example a bit.

Kmeans with correlation distance:

The K-means function:

myKmeans <- function(x, centers, distFun, nItter=10) {
    clusterHistory <- vector(nItter, mode="list")
    centerHistory <- vector(nItter, mode="list")

    for(i in 1:nItter) {
        distsToCenters <- distFun(x, centers)
        clusters <- apply(distsToCenters, 1, which.min)
        centers <- apply(x, 2, tapply, clusters, mean)
        # Saving history
        clusterHistory[[i]] <- clusters
        centerHistory[[i]] <- centers
    }

    list(clusters=clusterHistory, centers=centerHistory)
}

And correlation distance:

myCor <- function(points1, points2) {
    return(1 - ((cor(t(points1), t(points2))+1)/2))
}

theResult <- myKmeans(mat, centers, myCor, 10)

As was also displayed HERE

Here how both solution would look like:

plot(points, col=as.integer(points[,1] > points[,2])+1, main="Using a line", xlab="x", ylab="y")
plot(points, col=theResult$clusters[[10]], main="K-means with correlation clustering", xlab="x", ylab="y")
points(theResult$centers[[10]], col=1:2, cex=3, pch=19)

linevskmeans

So it's more about what kind of distance measure you are using and not about some kind of deficiency of K-means.

You can also find better implementations of K-means with correlation distance for R instead of using the one I provided here.

How to find the right cluster algorithm?

Answers (2)

Related Questions