user1464628
user1464628

Reputation:

K-means algorithm variation with minimum measure of size

I'm looking for some algorithm such as k-means for grouping points on a map into a fixed number of groups, by distance. The number of groups has already been decided, but the trick part (at least for me) is to meet the criteria that the sum of MOS of each group should in the certain range, say bigger than 1. Is there any way to make that happen?

ID MOS          X        Y

1 0.47   39.27846 -76.77101    
2 0.43   39.22704 -76.70272    
3 1.48   39.24719 -76.68485    
4 0.15   39.25172 -76.69729    
5 0.09   39.24341 -76.69884  

Upvotes: 4

Views: 1113

Answers (1)

Marc in the box
Marc in the box

Reputation: 11995

I was intrigued by your question but was unsure how you might introduce some sort of random process into a grouping algorithm. Seems that the kmeans algorithm does indeed give different results if you permutate your dataset (e.g. the order of the rows). I found this bit of info here. The following script demonstrates this with a random set of data. The plot shows the raw data in black and then draws a segment to the center of each cluster by permutation (colors).

Since I'm not sure how your MOS variable is defined, I have added a random variable to the dataframe to illustrate how you might look for clusterings that satisfy a given criteria. The sum of MOS is calculated for each cluster and the result is stored in the MOS.sums object. In order to reproduce a favorable clustering, you can use the random seed value that was used for the permutation, which is stored in the seeds object. You can see that the permutations result is several different clusterings:

set.seed(33)
nsamples=500
nperms=10
nclusters=3

df <- data.frame(x=runif(nsamples), y=runif(nsamples), MOS=runif(nsamples))

MOS.sums <- matrix(NaN, nrow=nperms, ncol=nclusters)
colnames(MOS.sums) <- paste("cluster", 1:nclusters, sep=".")
rownames(MOS.sums) <- paste("perm", 1:nperms, sep=".")

seeds <- round(runif(nperms, min=1, max=10000))

    plot(df$x, df$y)
COL <- rainbow(nperms)
for(i in seq(nperms)){
    set.seed(seeds[i])
    ORD <- sample(nsamples)
    K <- kmeans(df[ORD,1:2], centers=nclusters)
    MOS.sums[i,] <- tapply(df$MOS[ORD], K$cluster, sum)
    segments(df$x[ORD], df$y[ORD], K$centers[K$cluster,1], K$centers[K$cluster,2], col=COL[i])
}
seeds
MOS.sums 

enter image description here

Upvotes: 3

Related Questions