Akavall
Akavall

Reputation: 86188

Different results when using sklearn RandomizedPCA with sparse and dense matrices

I am getting different results when Randomized PCA with sparse and dense matrices:

import numpy as np
import scipy.sparse as scsp
from sklearn.decomposition import RandomizedPCA

x = np.matrix([[1,2,3,2,0,0,0,0],
               [2,3,1,0,0,0,0,3],
               [1,0,0,0,2,3,2,0],
               [3,0,0,0,4,5,6,0],
               [0,0,4,0,0,5,6,7],
               [0,6,4,5,6,0,0,0],
               [7,0,5,0,7,9,0,0]])

csr_x = scsp.csr_matrix(x)

s_pca = RandomizedPCA(n_components=2)
s_pca_scores = s_pca.fit_transform(csr_x)
s_pca_weights = s_pca.explained_variance_ratio_

d_pca = RandomizedPCA(n_components=2)
d_pca_scores = s_pca.fit_transform(x)
d_pca_weights = s_pca.explained_variance_ratio_

print 'sparse matrix scores {}'.format(s_pca_scores)
print 'dense matrix scores {}'.format(d_pca_scores)
print 'sparse matrix weights {}'.format(s_pca_weights)
print 'dense matrix weights {}'.format(d_pca_weights)

Result:

sparse matrix scores [[  1.90912166   2.37266113]
 [  1.98826835   0.67329466]
 [  3.71153199  -1.00492408]
 [  7.76361811  -2.60901625]
 [  7.39263662  -5.8950472 ]
 [  5.58268666   7.97259172]
 [ 13.19312194   1.30282165]]
dense matrix scores [[-4.23432815  0.43110596]
 [-3.87576857 -1.36999888]
 [-0.05168291 -1.02612363]
 [ 3.66039297 -1.38544473]
 [ 1.48948352 -7.0723618 ]
 [-4.97601287  5.49128164]
 [ 7.98791603  4.93154146]]
sparse matrix weights [ 0.74988508  0.25011492]
dense matrix weights [ 0.55596761  0.44403239]

The dense version gives the results with normal PCA, but what is going on when the matrix is sparse? Why are results different?

Upvotes: 3

Views: 1852

Answers (1)

ogrisel
ogrisel

Reputation: 40159

In the case of the sparse data, the RandomizedPCA does not center the data (mean removal) as it might blow up the memory usage. That probably explains what you observe.

I agree this "feature" is poorly documented. Please feel free to report an issue on github to track it and improve the doc.

Edit: we fixed that discrepancy in scikit-learn 0.15: RandomizedPCA is not deprecated for sparse data. Instead use TruncatedSVD that does the same as PCA without trying to center the data.

Upvotes: 7

Related Questions