Reputation: 149
I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
Upvotes: 0
Views: 1744
Reputation: 5059
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group, and it's not deterministic, but it at least gives you an approximation. In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.
Upvotes: 0
Reputation: 77454
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Upvotes: 1