Is there a way to compute the explained variance of PCA on a test set?

Question

I want to see how well PCA worked with my data.

I applied PCA on a training set and used the returned pca object to transform on a test set. pca object has a variable pca.explained_variance_ratio_ which tells me the percentage of variance explained by each of the selected components for the training set. After applying the pca transform, I want to see how well this worked on the test set. I tried inverse_transform() that returned what the original values would look like but I have no way to compare how it worked on the train set vs test set.

pca = PCA(0.99)
pca.fit(train_df)  
tranformed_test = pca.transform(test_df)
inverse_test = pca.inverse_transform(tranformed_test)
npt.assert_almost_equal(test_arr, inverse_test, decimal=2)

This returns:

Arrays are not almost equal to 2 decimals

Is there something like pca.explained_variance_ratio_ after transform()?

TomDLT · Accepted Answer

Variance explained for each components

You can compute it manually. If the components X_i are orthogonal (which is the case in PCA), the explained variance by X_i out of X is: 1 - ||X_i - X||^2 / ||X - X_mean||^2

Hence the following example:

import numpy as np
from sklearn.decomposition import PCA

X_train = np.random.randn(200, 5)
X_test = np.random.randn(100, 5)
model = PCA(n_components=5).fit(X_train)

def explained_variance(X):
    result = np.zeros(model.n_components)
    for ii in range(model.n_components):
        X_trans = model.transform(X)
        X_trans_ii = np.zeros_like(X_trans)
        X_trans_ii[:, ii] = X_trans[:, ii]
        X_approx_ii = model.inverse_transform(X_trans_ii)

        result[ii] = 1 - (np.linalg.norm(X_approx_ii - X) /
                          np.linalg.norm(X - model.mean_)) ** 2
    return result


print(model.explained_variance_ratio_)
print(explained_variance(X_train))
print(explained_variance(X_test))
# [0.25335711 0.23100201 0.2195476  0.15717412 0.13891916]
# [0.25335711 0.23100201 0.2195476  0.15717412 0.13891916]
# [0.17851083 0.199134   0.24198887 0.23286815 0.14749816]

Total variance explained

Alternatively, if you only care about the total variance explained, you can use r2_score:

from sklearn.metrics import r2_score

model = PCA(n_components=2).fit(X_train)
print(model.explained_variance_ratio_.sum())
print(r2_score(X_train, model.inverse_transform(model.transform(X_train)),
               multioutput='variance_weighted'))
print(r2_score(X_test, model.inverse_transform(model.transform(X_test)),
               multioutput='variance_weighted'))
# 0.46445451252373826
# 0.46445451252373815
# 0.4470229486590848

Is there a way to compute the explained variance of PCA on a test set?

Answers (1)

Variance explained for each components

Total variance explained

Related Questions