Reputation: 8054
I'm attempting to use sklearn's PCA functionality to reduce my data to 2 dimensions. However, I noticed when I do this using the fit_transform() function the result does not match the result of multiplying the components_ attribute with my input data.
Why don't these match? Which result is correct?
def test_pca_fit_transform(self):
from sklearn.decomposition import PCA
input_data = np.matrix([[11,4,9,3,2,2], [7,2,8,2,0,2], [3,1,2,5,2,9]])
#each column of input data is an observation, each row is a dimension
#method1
pca = PCA(n_components=2)
data2d = pca.fit_transform(input_data.T)
#method2
component_matrix = np.matrix(pca.components_)
data2d_mult = (component_matrix * input_data).T
np.testing.assert_almost_equal(data2d, data2d_mult)
#FAILS!!!
Upvotes: 3
Views: 1788
Reputation: 15889
The only step you are missing (which sklearn
handles internally) is the data centering. In order to perform PCA your data needs to be centered, if its not, one of the first lines of sklearn's PCA's fit method is:
X -= X.mean(axis=0)
Which centers your data along the first axis.
In order to achieve the same result as sklearn (which is the correct one), you just need to center your data either before fit or before your method2
.
Find here a working example:
X = np.array([[11,4,9,3,2,2], [7,2,8,2,0,2], [3,1,2,5,2,9]])
X = X.T.copy()
# PCA
pca = PCA(n_components=2)
data = pca.fit_transform(X)
# Your method 2
data2 = X.dot(pca.components_.T)
# Centering the data before method 2
data3 = X - X.mean(axis=0)
data3 = data3.dot(pca.components_.T)
# Compare
print np.allclose(data, data2) # prints False
print np.allclose(data, data3) # prints True
Note that I use .dot
on standard numpy arrays instead of *
in numpy matrix as I prefer to avoid using matrix
whenever possible, but the result is the same.
Upvotes: 7