lizardfireman
lizardfireman

Reputation: 369

sklearn PCA.transform gives different results for different trials

I am doing some PCA using sklearn.decomposition.PCA. I found that if the input matrix X is big, the results of two different PCA instances for PCA.transform will not be the same. For example, when X is a 100x200 matrix, there will not be a problem. When X is a 1000x200 or a 100x2000 matrix, the results of two different PCA instances will be different. I am not sure what's the cause for this: I suppose there is no random elements in sklearn's PCA solver? I am using sklearn version 0.18.1. with python 2.7

The script below illustrates the issue.

import numpy as np
import sklearn.linear_model as sklin 
from sklearn.decomposition import PCA

n_sample,n_feature = 100,200
X = np.random.rand(n_sample,n_feature)
pca_1 = PCA(n_components=10)
pca_1.fit(X)
X_transformed_1 = pca_1.transform(X)

pca_2 = PCA(n_components=10)
pca_2.fit(X)
X_transformed_2 = pca_2.transform(X)

print(np.sum(X_transformed_1 == X_transformed_2) )
print(np.mean((X_transformed_1 - X_transformed_2)**2) )

Upvotes: 4

Views: 3874

Answers (2)

Areza
Areza

Reputation: 6080

I had a similar problem even with the same trial number but on different machines I was getting different result setting the svd_solver to 'arpack' solved the problem

Upvotes: 1

Vivek Kumar
Vivek Kumar

Reputation: 36599

There's a svd_solver param in PCA and by default it has value "auto". Depending on the input data size, it chooses most efficient solver.

Now as for your case, when size is larger than 500, it will choose randomized.

svd_solver : string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}

auto :

the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

To control how the randomized solver behaves, you can set random_state param in PCA which will control the random number generator.

Try using

pca_1 = PCA(n_components=10, random_state=SOME_INT)
pca_2 = PCA(n_components=10, random_state=SOME_INT)

Upvotes: 8

Related Questions