Ruddy Cahyanto
Ruddy Cahyanto

Reputation: 113

Is it possible that there are clusters that do not have members in k-means clustering?

I have a text clustering project using the k-means algorithm. My dataset is the political sentiment of Facebook comments, so that each comment has a positive, negative, and neutral label.

What I have done in my application program is as follows:

1. Normalize text and delete stopwords.
2. Term weighting using tf-idf.
3. Constructing a matrix of vectors for each comment data.
4. Set the number of clusters to 3 (based on the number of labels).
5. Choose centroid randomly based on the label. So each label contributes 1 comment as a centroid.
6. Calculate the distance of each comment's vector with each centroid, assign to the closest centroid.
7. Calculate the vector's average of each cluster as a new centroid.
8. Repeat steps 6 and 7 until the centroid does not change.
9. The final cluster results.

In the clustering results that I got, there are clusters that do not have members. For example I set the number of clusters to be 3, but the result is that there are only 2 clusters that have members and 1 other cluster is empty.

Is this possible for the k-means algorithm? How to solve this issue? Or maybe there are bugs in my application program?

Upvotes: 2

Views: 1743

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Yes, k-means clusters can become empty.

This is more likely to happen with bad starting values and other programming errors, so if you frequently see this, I would still debug.

You have one big conceptual error in your approach, too: there is nothing here that would make the classes "positive", "negative", or "neutral". These are supervised concepts, clustering cannot do this.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269483

Obviously, it is possible. If you ask for three clusters but have only two data points, then you will not get three clusters.

More practically, this seems highly suspicious. K-means usually starts with existing data points as the cluster centers. Each subsequent iteration of k-means uses the centroid of a bunch of points in the data. For this cluster to have no data points would require that there exists a centroid of data points that is NOT the closest centroid to at least one of those data points.

It might be possible to construct a scenario where this occurs. But it seems highly unlikely in a real world example. Is it possible that you have some other filtering mechanism on the clusters -- such as a minimum size -- that is filtering out clusters? It is not at all uncommon for k-means to produce very small outlier clusters. In fact, I think it is best suited for finding those.

Upvotes: 2

Related Questions