user8751092
user8751092

Reputation:

Principal component analysis in matlab?

I have a training set with the size of (size(X_Training)=122 x 125937).

From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.

But when I use in matlab:

X_new = pca(X_Training)

I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.

Any help? Did I badly interpret the results? I only want to reduce the number of features.

Upvotes: 1

Views: 2148

Answers (1)

rayryeng
rayryeng

Reputation: 104493

Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.

Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.

In other words:

n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);

X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.

Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.

Therefore, you need to provide additional output variables to capture these:

[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');

I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:

[~,n_components] = max(cumsum(explained) >= 95);

Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:

X_reconstruct = bsxfun(@plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);

mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.

X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;

Upvotes: 3

Related Questions