Reputation: 3004
I am running k-means
clustering on a dataset with around 1 million items and around 100 attributes. I applied clustering for various k
, and I want to evaluate the different groupings with the silhouette score
from sklearn library. Running it with no sampling seems unfeasible and takes a prohibitively long time. So, I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is. Is there a rule of thumb for ideal sample size to use given the size of my matrix? And is it better to take the largest sample size that my analysis machine can handle? Or should I take the average of smaller samples?
My preliminary test with sample_size=10000
has produced some unintuitive results.
I am open for alternative and more scalable evaluation metrics.
Editing to visualize the issue:
The plot shows the silhouette score against the number of clusters.
In my opinion, increasing sample size is reducing the noise is a normal behavior. But given that I have 1 million, a heterogenous vector, 2
or 3
clusters is the "best" number of clusters seems unintuitive. In other words, I would expect to find a monotonic decreases in silhouette score as I increase the number of clusters.
Upvotes: 16
Views: 22329
Reputation: 10366
Stumbled over the same unintuitive monotonic decreases in silhouette score as I increase the number of clusters.
However, taking a closer look on the equation to calculate the intra-cluster distance a(i) you can see that this distance is only calculated for other data points j in the same cluster as i. So with an increasing number of clusters, and thus an increasing number of clusters with only one sample in them, these will default to 0 thus reducing the silhouette score.
The equation is from Wikipedia: https://en.wikipedia.org/wiki/Silhouette_(clustering)
Upvotes: 0
Reputation: 69
Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000
) to get a sensing on whether the sample size is large enough to produce consistent results.
Upvotes: 0
Reputation: 2011
kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.
Upvotes: 1
Reputation: 1803
Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.
Upvotes: 7