Interpreting clustering metrics

Question

I'm doing clustering by k-means in Scikit-learn on 398 samples, 306 features. The features matrix is sparse, and the number of clusters is 4. To improve the clustering, I tried two approaches:

After clustering, I used ExtraTreesClassifier() to classify and compute feature importances (samples labeled in clustering)

I used PCA to reduce the feature dimension to 2. I have computed the following metrics (SS, CH, SH)

    Method                   sum_of_squares, Calinski_Harabasz, Silhouette

    1 kmeans                    31.682        401.3            0.879
    2 kmeans+top-features       5989230.351   75863584.45      0.977
    3 kmeans+PCA                890.5431893   58479.00277      0.993

My questions are:

As far as I know, if sum of squares is smaller, the performance of clustering method is better, while if Silhouette is close to 1 the performance of clustering method is better. For instance in the last row both sum of squares and Silhouette are increased compared to the first row.
How can I choose which approach has better performance?

Has QUIT--Anony-Mousse · Accepted Answer

Never compare sum-of-squares and similar metrics across different projections, transformations or data sets.

To see why, simply multiply every feature by 0.5 - your SSQ will drop by 0.25. So to "improve" your data set, you just need to scale it to a tiny size...

These metrics must only be used on the exact same input and parameters. You can't even use sum-of-squares to compare k-means with different k, because the larger k will win. All you can do is multiple random attempts, and then keep the best minimum you found this way.

Interpreting clustering metrics

Answers (2)

Related Questions