Reputation: 601
I have two clusters of points. Before I apply any clustering technique I know exactly what points should belong to each cluster, however the only way to label the data is with a clustering technique such as k-means. If the situation I am in seems puzzling don't focus on it, I am more interested in this potential specific issue with k-means.
Say my data looks like this (simple 2D points on x-y plane):
I want to get two clusters of points however there is a small problem. When I run a k-means algorithm I end up with something like this:
I should add this is just a sketched example.
The problem I am having is when the clusters have a vastly uneven number of points within them before the algorithm is run then it has a significant outcome on the algorithmic clustering at the end, to the point it obscures the data. Of course this is only an issue when the clusters are vaguely close together but I was wondering if there is a k-means variant or other clustering algorithm that handles different population sizes of clusters very well. I have tried to find such a thing but I fear I am using the wrong search terms such as "uneven k-means cluster populations" and similar phrasings only get me papers on faster k-means implementations and combinations with other statistical analysis.
Just to rest some concerns. I have run k-means several times and the result is always what was sketches above, with a cluster centroid between two visual clusters.
If this is just a disadvantage k-means has (and I can see it being so) then I can accept that.
Upvotes: 3
Views: 5079
Reputation: 12715
The output of K-Means algorithm depends a lot on the initial centroids that you choose. If you choose centroids that are close to one another then the clusters that you get will be skewed.
Moreover if the true clusters have unbalanced number of data points then by choosing the initial centroids randomly there is a high probability that you would choose the initial centroids from the same cluster.
Hence I would suggest that you try to choose the initial centroids that are as far apart as possible. This should be possible since your points are 2D.
You can even explore agglomerative clustering methods like Single Link or Complete Link Algorithms.
That said, these algorithm do not guarantee optimal results so you will have to be content with some sub-optimality.
Hope this helps.
Upvotes: 3