Abhishek Bhatia
Abhishek Bhatia

Reputation: 9806

Explained variance calculation

My questions are specific to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.

  1. I don't understand why you square eigenvalues https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/decomposition/pca.py#L444 here?
  2. Also, explained_variance is not computed for new transformed data other than original data used to compute eigen-vectors. Is that not normally done?
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X) 
pca.transform(Y)

In this case, won't you separately calculate explained variance for data Y as well. For that purpose, I think we would have to use point 3 instead of using eigen-values.

  1. Explained variance can be also computed by taking the variance of each axis in the transformed space and dividing by the total variance. Any reason that is not done here?

Upvotes: 1

Views: 901

Answers (1)

seralouk
seralouk

Reputation: 33147

Answers to your questions:

1) The square roots of the eigenvalues of the scatter matrix (e.g. XX.T) are the singular values of X (see here: https://math.stackexchange.com/a/3871/536826). So you square them. Important: the initial matrix X should be centered (data has been preprocessed to have zero mean) in order for the above to hold.

2) Yes this is the way to go. explained_variance is computed based on the singular values. See point 1.

3) It's the same but in the case you describe you HAVE to project the data and then do additional computations. No need for that if you just compute it using the eigenvalues / singular values (see point 1 again for the connection between these two).

Finally, keep in mind that not everyone really wants to project the data. Someone can only get the eigenvalues and then immediately estimate the explained variance WITHOUT projecting the data. So that's the best gold standard way to do it.

EDIT 1:

Answer to edited Point 2

No. PCA is an unsupervised method. It only transforms the X data not the Y (labels).

Again, the explained variance can be computed fast, easily, and with half line of code using the eigenvalues/singular values OR as you said using the projected data e.g. estimating the covariance of the projected data, then variances of PCs will be in the diagonal.

Upvotes: 2

Related Questions