How to set the range of K while finding its optimum value?

While the KMeans algorithm clusters large datasets, in order to find the optimal value of K, we can use the following code snippet:

model = KMeans()
visualizer = KElbowVisualizer(model, k=(min_value, max_value), timings=False, locate_elbow=True)  
visualizer.fit(data)
no_of_clusters= visualizer.elbow_value_

In this we specify the range(min_value and max_value) in which we should get the K value. For large datasets (for ex: 1 Million rows), how do we find the best combination of these ranges so that we can save a lot of execution time?

Upvotes: 1

Answers (4)

Has QUIT--Anony-Mousse

Reputation: 77454

Subsample your data.

K-means is based on means. The precision of the means does not improve much with more data. So just use 10k objects, that is enough.

Upvotes: 0

Raghuvaran M

Reputation: 3

Before answering this Data science is your Intuition with Trial& Error. We don't get the solution one shot. Use silhouette score to evaluate the best values of K. Or Use 3 values at a time & compute silhouette scores & then see if they are giving you good score.

Upvotes: 1

praneeth

Reputation: 482

Good question on how to arrive at a sensible range for K. There are a couple of scenarios where you want to place your problem in.

Scenario 1: We know the business context i.e how the result would be useful. Say if we are trying to group countries into some clusters - developing countries, developed countries, and under-developed coutries. Here we know the approximate range of values which are driven by business. In that case, you might think of increasing the range by a couple of clusters.

Scenario 2: We have least idea of the business use of the clusters. In such cases, you may try using metrics like silhouette score for each value of K and stop at a value where you find the max silhouette score. A small tweak here is in increasing the value of K by 2/3 rather than 1 each time in case the number of clusters you are looking at are in the order of 10s.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

Upvotes: 1

vineagle

Reputation: 39

Actually deciding the cluster mainly depends upon your Application.

But in My Case I Follow the following Values: 1. For Small Data And Less Critical Application: Kmin = 2 and Kmax = 10 2. For Small Data and More Critical Application: Kmin = 2 and Kmax = max upto 20 3. For Large Data And Less Critical Application: Kmin = 2 and Kmax = between 5 - 10 4. For Large Data And More Critical Application: Kmin = 2 and Kmax = 10 - 15

For any Case Don't Go beyond 30.

Upvotes: 0

How to set the range of K while finding its optimum value?

Answers (4)

Related Questions