Reputation: 113
I have a text clustering project using the k-means algorithm. My dataset is the political sentiment of Facebook comments, so that each comment has a positive, negative, and neutral label.
What I have done in my application program is as follows:
1. Normalize text and delete stopwords.
2. Term weighting using tf-idf.
3. Constructing a matrix of vectors for each comment data.
4. Set the number of clusters to 3 (based on the number of labels).
5. Choose centroid randomly based on the label. So each label contributes 1 comment as a centroid.
6. Calculate the distance of each comment's vector with each centroid, assign to the closest centroid.
7. Calculate the vector's average of each cluster as a new centroid.
8. Repeat steps 6 and 7 until the centroid does not change.
9. The final cluster results.
In the clustering results that I got, there are clusters that do not have members. For example I set the number of clusters to be 3, but the result is that there are only 2 clusters that have members and 1 other cluster is empty.
Is this possible for the k-means algorithm? How to solve this issue? Or maybe there are bugs in my application program?
Upvotes: 2
Views: 1743
Reputation: 77454
Yes, k-means clusters can become empty.
This is more likely to happen with bad starting values and other programming errors, so if you frequently see this, I would still debug.
You have one big conceptual error in your approach, too: there is nothing here that would make the classes "positive", "negative", or "neutral". These are supervised concepts, clustering cannot do this.
Upvotes: 1
Reputation: 1269483
Obviously, it is possible. If you ask for three clusters but have only two data points, then you will not get three clusters.
More practically, this seems highly suspicious. K-means usually starts with existing data points as the cluster centers. Each subsequent iteration of k-means uses the centroid of a bunch of points in the data. For this cluster to have no data points would require that there exists a centroid of data points that is NOT the closest centroid to at least one of those data points.
It might be possible to construct a scenario where this occurs. But it seems highly unlikely in a real world example. Is it possible that you have some other filtering mechanism on the clusters -- such as a minimum size -- that is filtering out clusters? It is not at all uncommon for k-means to produce very small outlier clusters. In fact, I think it is best suited for finding those.
Upvotes: 2