Googme
Googme

Reputation: 914

How to find the right cluster algorithm?

I would like to find the algorithm which circumvent some drawbacks of k-Means:

Given:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

matrix<-cbind(x,y)# defining matrix
Kmeans<-kmeans(matrix,centers=2) # with 3 centroids

plot(x,y,col=Kmeans$cluster,pch=19,cex=2)
points(Kmeans$centers,col=1:3,pch=3,cex=3,lwd=3)

Here I would like have an algorithm clustering the data into two clusters divided by a diagonal from left corner to right corner.

Upvotes: 1

Views: 324

Answers (2)

Karolis Koncevičius
Karolis Koncevičius

Reputation: 9656

What you are asking for can be solved in multiple ways. Here are two:

  1. First way is to simply define the separating line of you clusters. Since you know how your points should be grouped (by a line) you can use that.

If you want your line to start at the origin, then simply check if x > y:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

thePoints <- cbind(x,y)


as.integer(thePoints[,1] > thePoints[,2])
[1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

This will put points above the diagonal (starting at 0) in one group, and others - to another group. Keep in mind that if your line may not go through the origin (0) then you have to modify this example a bit.

  1. Kmeans with correlation distance:

The K-means function:

myKmeans <- function(x, centers, distFun, nItter=10) {
    clusterHistory <- vector(nItter, mode="list")
    centerHistory <- vector(nItter, mode="list")

    for(i in 1:nItter) {
        distsToCenters <- distFun(x, centers)
        clusters <- apply(distsToCenters, 1, which.min)
        centers <- apply(x, 2, tapply, clusters, mean)
        # Saving history
        clusterHistory[[i]] <- clusters
        centerHistory[[i]] <- centers
    }

    list(clusters=clusterHistory, centers=centerHistory)
}

And correlation distance:

myCor <- function(points1, points2) {
    return(1 - ((cor(t(points1), t(points2))+1)/2))
}

theResult <- myKmeans(mat, centers, myCor, 10)

As was also displayed HERE

Here how both solution would look like:

plot(points, col=as.integer(points[,1] > points[,2])+1, main="Using a line", xlab="x", ylab="y")
plot(points, col=theResult$clusters[[10]], main="K-means with correlation clustering", xlab="x", ylab="y")
points(theResult$centers[[10]], col=1:2, cex=3, pch=19)

linevskmeans

So it's more about what kind of distance measure you are using and not about some kind of deficiency of K-means.

You can also find better implementations of K-means with correlation distance for R instead of using the one I provided here.

Upvotes: 0

iago-lito
iago-lito

Reputation: 3218

Try Mclust from the mclust package, it will try to fit a Gaussian mixture on your data. The default behavior:

mc = Mclust(matrix);
points(t(mc$parameters$mean));
plot(mc);

.. will find 4 groups, but you might be able to force it to 2 or to force the correlation structure for the Gaussians to be stretched in the right direction.

Be aware that it'll be hard to interpret and justify the meaning of your groups unless you understand very well the reason why you want them to be 2 etc..

Upvotes: 1

Related Questions