Reputation: 875
I'm doing clustering by k-means in Scikit-learn on 398 samples, 306 features. The features matrix is sparse, and the number of clusters is 4. To improve the clustering, I tried two approaches:
After clustering, I used ExtraTreesClassifier() to classify and compute feature importances (samples labeled in clustering)
I used PCA to reduce the feature dimension to 2. I have computed the following metrics (SS, CH, SH)
Method sum_of_squares, Calinski_Harabasz, Silhouette
1 kmeans 31.682 401.3 0.879
2 kmeans+top-features 5989230.351 75863584.45 0.977
3 kmeans+PCA 890.5431893 58479.00277 0.993
My questions are:
Upvotes: 1
Views: 185
Reputation: 2749
With 306 features you are under the curse of dimensionality. Clustering in 306 dimensions is not meaningful. Therefore I wouldn't select features after clustering.
To get interpretable results, you need to reduce dimensionality. For 398 samples you need low dimension (2, 3, maybe 4). Your PCA with dimension 2 is good. You can try 3.
An approach with selecting important features before clustering may be problematic. Anyway, are 2/3/4 "best" features meaningful in your case?
Upvotes: 1
Reputation: 77474
Never compare sum-of-squares and similar metrics across different projections, transformations or data sets.
To see why, simply multiply every feature by 0.5 - your SSQ will drop by 0.25. So to "improve" your data set, you just need to scale it to a tiny size...
These metrics must only be used on the exact same input and parameters. You can't even use sum-of-squares to compare k-means with different k, because the larger k will win. All you can do is multiple random attempts, and then keep the best minimum you found this way.
Upvotes: 2