paulo
paulo

Reputation: 65

PCA with sklearn discrepancies

I am trying to apply a PCA in a very specific context and ran into a behavior that I can not explain. As a test I am running the following code with the file data that you can retrieve here: https://www.dropbox.com/s/vdnvxhmvbnssr34/test.npy?dl=0 (numpy array format).

from sklearn.decomposition import PCA
import numpy as np
test    = np.load('test.npy')
pca     = PCA() 
X_proj  = pca.fit_transform(test)       ### Project in the basis of eigenvectors
proj    = pca.inverse_transform(X_proj) ### Reconstruct vector

My issue is the following: Because I do not specify any number of components, I should here be reconstructing with all the computed components. I therefore expect my ouput proj to be the same as my input test. But a quick plot proves this not to be the case:

plt.figure()
plt.plot(test[0]-proj[0])
plt.show()

The plot here will show some large discrepancies between projection and the input matrix.

Does anyone have an idea or explanation to help me understand why proj is different from test in my case?

Upvotes: 2

Views: 57

Answers (1)

Sumit Chaturvedi
Sumit Chaturvedi

Reputation: 348

I checked the your test data and found the following:

mean = test.mean() # 1.9545972004854737e+24
std = test.std() # 9.610595443778275e+26

I interpret the standard deviation to represent, in some sense, the least count or the uncertainty in the values that are reported. By that I mean that if a numerical algorithm reports the answer to be a, then the real answer should be in the interval [a - std, a + std]. This is because numerical algorithms are imprecise by their very nature. They depend on floating point operations which obviously can't represent real numbers in all there glory.

So if I plot:

plt.plot((test[0]-proj[0])/std)
plt.show()

I get the following plot which seems more reasonable.

Plot

You may be interested in plotting relative errors as well. Alternately, you can normalize your data to have 0 mean and unit variance and then the PCA results should be more accurate.

Upvotes: 2

Related Questions