YNR
YNR

Reputation: 875

Interpreting clustering metrics

I'm doing clustering by k-means in Scikit-learn on 398 samples, 306 features. The features matrix is sparse, and the number of clusters is 4. To improve the clustering, I tried two approaches:

  1. After clustering, I used ExtraTreesClassifier() to classify and compute feature importances (samples labeled in clustering)

  2. I used PCA to reduce the feature dimension to 2. I have computed the following metrics (SS, CH, SH)

        Method                   sum_of_squares, Calinski_Harabasz, Silhouette
    
        1 kmeans                    31.682        401.3            0.879
        2 kmeans+top-features       5989230.351   75863584.45      0.977
        3 kmeans+PCA                890.5431893   58479.00277      0.993
    

My questions are:

  1. As far as I know, if sum of squares is smaller, the performance of clustering method is better, while if Silhouette is close to 1 the performance of clustering method is better. For instance in the last row both sum of squares and Silhouette are increased compared to the first row.
  2. How can I choose which approach has better performance?

Upvotes: 1

Views: 185

Answers (2)

lanenok
lanenok

Reputation: 2749

With 306 features you are under the curse of dimensionality. Clustering in 306 dimensions is not meaningful. Therefore I wouldn't select features after clustering.

To get interpretable results, you need to reduce dimensionality. For 398 samples you need low dimension (2, 3, maybe 4). Your PCA with dimension 2 is good. You can try 3.

An approach with selecting important features before clustering may be problematic. Anyway, are 2/3/4 "best" features meaningful in your case?

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Never compare sum-of-squares and similar metrics across different projections, transformations or data sets.

To see why, simply multiply every feature by 0.5 - your SSQ will drop by 0.25. So to "improve" your data set, you just need to scale it to a tiny size...

These metrics must only be used on the exact same input and parameters. You can't even use sum-of-squares to compare k-means with different k, because the larger k will win. All you can do is multiple random attempts, and then keep the best minimum you found this way.

Upvotes: 2

Related Questions