Reputation: 86188
I am getting different results when Randomized PCA
with sparse and dense matrices:
import numpy as np
import scipy.sparse as scsp
from sklearn.decomposition import RandomizedPCA
x = np.matrix([[1,2,3,2,0,0,0,0],
[2,3,1,0,0,0,0,3],
[1,0,0,0,2,3,2,0],
[3,0,0,0,4,5,6,0],
[0,0,4,0,0,5,6,7],
[0,6,4,5,6,0,0,0],
[7,0,5,0,7,9,0,0]])
csr_x = scsp.csr_matrix(x)
s_pca = RandomizedPCA(n_components=2)
s_pca_scores = s_pca.fit_transform(csr_x)
s_pca_weights = s_pca.explained_variance_ratio_
d_pca = RandomizedPCA(n_components=2)
d_pca_scores = s_pca.fit_transform(x)
d_pca_weights = s_pca.explained_variance_ratio_
print 'sparse matrix scores {}'.format(s_pca_scores)
print 'dense matrix scores {}'.format(d_pca_scores)
print 'sparse matrix weights {}'.format(s_pca_weights)
print 'dense matrix weights {}'.format(d_pca_weights)
Result:
sparse matrix scores [[ 1.90912166 2.37266113]
[ 1.98826835 0.67329466]
[ 3.71153199 -1.00492408]
[ 7.76361811 -2.60901625]
[ 7.39263662 -5.8950472 ]
[ 5.58268666 7.97259172]
[ 13.19312194 1.30282165]]
dense matrix scores [[-4.23432815 0.43110596]
[-3.87576857 -1.36999888]
[-0.05168291 -1.02612363]
[ 3.66039297 -1.38544473]
[ 1.48948352 -7.0723618 ]
[-4.97601287 5.49128164]
[ 7.98791603 4.93154146]]
sparse matrix weights [ 0.74988508 0.25011492]
dense matrix weights [ 0.55596761 0.44403239]
The dense version gives the results with normal PCA, but what is going on when the matrix is sparse? Why are results different?
Upvotes: 3
Views: 1852
Reputation: 40159
In the case of the sparse data, the RandomizedPCA
does not center the data (mean removal) as it might blow up the memory usage. That probably explains what you observe.
I agree this "feature" is poorly documented. Please feel free to report an issue on github to track it and improve the doc.
Edit: we fixed that discrepancy in scikit-learn 0.15: RandomizedPCA is not deprecated for sparse data. Instead use TruncatedSVD that does the same as PCA without trying to center the data.
Upvotes: 7