Which features were extracted after PCA?

Question

I am trying to extract features using PCA. Now I got a MATLAB code from StackExchange, as given below, where it selects the top 100 features which are most relevant after PCA. Now as I executed the code, I found that the eigenvalues were already sorted in the first statement. Why do we need to again sort it in descending order?

[eigenvectors, projected_data, eigenvalues] = princomp(proteingene);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected_data(:, feature_idx(1:100));

Another question is, according to my feature set, the projected_data variable shows columns (features) which have a value of 0. This means that these features do not have much significance. Am I right?

Last question is, how would I know which were the features that were extracted in the PCA?

Daniel · Accepted Answer

I found that the eigenvalues were already sorted in the first statement. Why do we need to again sort it in descending order?

You don't need to sort again, since princomp does that by default. However, princomp is now a deprecated function, so you should use the built-in function PCA.

I can only speculate that the author of the code included the call to sort for completeness, so it is clear what he is doing (as sort returns feature_idx). The following code snippet, for example, achieves the same result, but is not as clear for the reader:

[eigenvectors, projected_data, eigenvalues] = princomp(proteingene);
selected_projected_data = projected_data(:, 1:100);

For someone learning PCA, what does 1:100 mean?

Another question is, according to my feature set, the projected_data variable shows columns (features) which have a value of 0. This means that these features do not have much significance. Am I right?

I think a safe answer is to say that your features don't have much significance if the corresponding eigenvalue has a value close to 0. You can check that by looking at eigenvalue(feature_idx), in the case of the code you posted. Columns (features) with entries of 0 simply mean that they don't span a certain dimension of your space. You can think of them as vectors in a standard real vector space.

Last question is, how would I know which were the features that were extracted in the PCA?

Those are given in projected_data! That variable contains the data projected along the direction of your eigenvectors. You see, feature extraction is really an interpretation on top of the PCA decomposition. PCA doesn't "extract" any features, it simply changes the vector base that describes your data (you can see a visual explanation here). The base being composed of the eigenvectors of proteingene. To "extract" features you'll have to decide which of the columns of projected_data are relevant for you. In the code example, 100 features were "extracted" arbitrarily, without any criteria judging their significance for a particular given problem.

selected_projected_data = projected_data(:, feature_idx(1:100));

In fact, if your data proteingene has less than 100 dimensions, you'd even get an error trying to extract 100 features from it using PCA.

Which features were extracted after PCA?

Answers (1)

Related Questions