Regina
Regina

Reputation: 115

Metrics to estimate K in Kmeans

I'm trying to estimate number of K in Kmeans using Elbow and BIC methods.X is multidimensional array of data points (100000 data points X 100 features)
Here is the code I use for Elbow:

Ks = [40,50,60,70,80,90,100,110,120]
ds = []
for K in Ks:
    cls = MiniBatchKMeans(K, batch_size =1000, random_state = 101)
    for i in xrange(0, len(X), 1000):
        chunk = newvec[i:i + 1000]
        cls.partial_fit(chunk)
    ds.append(cls.inertia_)   
plt.plot(Ks, ds)  
plt.xlabel('Value of K')
plt.ylabel('Distortion')
plt.show()

The code I use for BIC is coming from hereby Prabhath Nanisetty

Here are plots I'm getting using each one of these methods: Elbow method BIC method

What is the right K value to use? Are those the right metrics to use for my dataset based on these results. Thanks you.

Upvotes: 0

Views: 382

Answers (1)

Antimony
Antimony

Reputation: 2240

I think your dataset has way too many dimensions and risks suffering from the curse of dimensionality.

But to answer your question, at least from the elbow method, it seems like K = 90 going by the elbow method. To use the BIC method you look at the highest value (according to that particular implementation; some implementations reverse the signs). This makes it a bit more ambiguous, but appears that after K = 60, all of them perform almost equally well.

You can also take a look at this article on the same topic. It introduces another method to estimate K, the Gap method. I'd say run one more metric to break ties and then select the best K returned by 2 or more out of the 3 metrics.

Upvotes: 2

Related Questions