Reputation: 21
While the KMeans algorithm clusters large datasets, in order to find the optimal value of K, we can use the following code snippet:
model = KMeans()
visualizer = KElbowVisualizer(model, k=(min_value, max_value), timings=False, locate_elbow=True)
visualizer.fit(data)
no_of_clusters= visualizer.elbow_value_
In this we specify the range(min_value and max_value) in which we should get the K value. For large datasets (for ex: 1 Million rows), how do we find the best combination of these ranges so that we can save a lot of execution time?
Upvotes: 1
Views: 391
Reputation: 77454
Subsample your data.
K-means is based on means. The precision of the means does not improve much with more data. So just use 10k objects, that is enough.
Upvotes: 0
Reputation: 3
Before answering this Data science is your Intuition with Trial& Error. We don't get the solution one shot. Use silhouette score to evaluate the best values of K. Or Use 3 values at a time & compute silhouette scores & then see if they are giving you good score.
Upvotes: 1
Reputation: 482
Good question on how to arrive at a sensible range for K. There are a couple of scenarios where you want to place your problem in.
Scenario 1: We know the business context i.e how the result would be useful. Say if we are trying to group countries into some clusters - developing countries, developed countries, and under-developed coutries. Here we know the approximate range of values which are driven by business. In that case, you might think of increasing the range by a couple of clusters.
Scenario 2: We have least idea of the business use of the clusters. In such cases, you may try using metrics like silhouette score for each value of K and stop at a value where you find the max silhouette score. A small tweak here is in increasing the value of K by 2/3 rather than 1 each time in case the number of clusters you are looking at are in the order of 10s.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
Upvotes: 1
Reputation: 39
Actually deciding the cluster mainly depends upon your Application.
But in My Case I Follow the following Values: 1. For Small Data And Less Critical Application: Kmin = 2 and Kmax = 10 2. For Small Data and More Critical Application: Kmin = 2 and Kmax = max upto 20 3. For Large Data And Less Critical Application: Kmin = 2 and Kmax = between 5 - 10 4. For Large Data And More Critical Application: Kmin = 2 and Kmax = 10 - 15
For any Case Don't Go beyond 30.
Upvotes: 0