Reputation: 103

How to select top 100 features(a subset) which are most relevant after pca?

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.

How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?

Upvotes: 3

Answers (3)

Has QUIT--Anony-Mousse

Reputation: 77454

Be careful!

With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.

With 63 observations, you can at most define a 62 dimensional hyperspace!

Upvotes: 0

Alan

Reputation: 3417

PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.

The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.

If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.

Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.

A quick matlab example of taking the top 100 principle components using PCA.

[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));

Upvotes: 4

fpe

Reputation: 2750

Have you tried with

B = sort(your_matrix,2,'descend');
C = B(:,1:100);

Upvotes: 0

How to select top 100 features(a subset) which are most relevant after pca?

Answers (3)

Related Questions