Metrics to estimate K in Kmeans

Question

I'm trying to estimate number of K in Kmeans using Elbow and BIC methods.X is multidimensional array of data points (100000 data points X 100 features)
Here is the code I use for Elbow:

Ks = [40,50,60,70,80,90,100,110,120]
ds = []
for K in Ks:
    cls = MiniBatchKMeans(K, batch_size =1000, random_state = 101)
    for i in xrange(0, len(X), 1000):
        chunk = newvec[i:i + 1000]
        cls.partial_fit(chunk)
    ds.append(cls.inertia_)   
plt.plot(Ks, ds)  
plt.xlabel('Value of K')
plt.ylabel('Distortion')
plt.show()

The code I use for BIC is coming from hereby Prabhath Nanisetty

Here are plots I'm getting using each one of these methods:

What is the right K value to use? Are those the right metrics to use for my dataset based on these results. Thanks you.

Antimony · Accepted Answer

I think your dataset has way too many dimensions and risks suffering from the curse of dimensionality.

But to answer your question, at least from the elbow method, it seems like K = 90 going by the elbow method. To use the BIC method you look at the highest value (according to that particular implementation; some implementations reverse the signs). This makes it a bit more ambiguous, but appears that after K = 60, all of them perform almost equally well.

You can also take a look at this article on the same topic. It introduces another method to estimate K, the Gap method. I'd say run one more metric to break ties and then select the best K returned by 2 or more out of the 3 metrics.

Metrics to estimate K in Kmeans

Answers (1)

Related Questions