Reputation: 2527
Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.
Using PCA on the Iris dataset(with the data in the csv ordered such that all of the first class are listed, then the second, then the third) yields the following plot:-
It can be seen that the three classes in the Iris dataset have been retained. However, when the order of the samples is randomised, the following plot is produced:-
Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?
Would there be innacuracies due to the discarding of lower order Principal Components?
EDIT:- To be clear, I am asking if a dataset can be clustered after running PCA, and if so, what the most accurate method would be.
Upvotes: 0
Views: 3074
Reputation: 77454
Try sorting the dataset after PCA, then plotting it.
The iris data set is much to simple to draw any valid conclusions about the behaviour of high-dimensional data, and the benefits of PCA.
Plus, "wise" - in which sense? If you want to eat pizza, it is not wise to plot the iris data set.
Upvotes: 0
Reputation: 14031
Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.
Your data might well separate in a low-variance dimension. I would not recommend running PCA prior to clustering.
Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?
There are effective clustering algorithms that do not require prior knowledge of the number of classes, such as Mean Shift and DBSCAN.
Upvotes: 1