Reputation: 5301
I was trying to check my implementation of PCA to see if I understood it and I tried to do PCA with 12 components on the MNIST data set (which I got using the tensorflow interface that normalized it for me). I obtained the principal components given by sklearn and then made reconstructions as follow:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train)
# do manual PCA
U = pca.components_
my_reconstruct = np.dot( U.T , np.dot(U, X_train.T) ).T
then I used the reconstruction interface given by sklearn to try to reconstruct as follow:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train)
X_reconstruct = pca.inverse_transform(X_pca)
and then checked the error as follow (since the rows are a data point and columns features):
print 'X_recon - X_my_reconstruct', (1.0/X_my_reconstruct.shape[0])*LA.norm(X_my_reconstruct - X_reconstruct)**2
#X_recon - X_my_reconstruct 1.47252586279
the error as you can see is non-zero and actually quite noticeable. Why is it? How is their reconstruction different from mine?
Upvotes: 3
Views: 2864
Reputation: 74172
I see a couple of issues:
The dot product should be X_pca.dot(pca.components_)
. PCA
factorizes your X_train
matrix using SVD:
Xtrain = U·S·Vᵀ.
Here, pca.components_
corresponds to Vᵀ (a (k, n_features)
matrix), not U (an (n_datapoints, k)
matrix).
The sklearn implementation of PCA is quite readable, and can be found here. I also wrote a pure numpy example in this previous answer.
Did you center X_train
by subtracting the mean value for each column before doing the fitting?
The PCA
class automatically centers your data and stores the original mean vector in its .mean_
attribute. If the mean vector for your input features was nonzero then you would need to add the mean to your reconstructions, i.e. my_reconstruct += pca.mean_
.
Upvotes: 2