Reputation: 3495
Trying to learn PCA through and through but interestingly enough when I use numpy and sklearn I get different covariance matrix results.
The numpy results match this explanatory text here but the sklearn results different from both.
Is there any reason why this is so?
d = pd.read_csv("example.txt", header=None, sep = " ")
print(d)
0 1
0 0.69 0.49
1 -1.31 -1.21
2 0.39 0.99
3 0.09 0.29
4 1.29 1.09
5 0.49 0.79
6 0.19 -0.31
7 -0.81 -0.81
8 -0.31 -0.31
9 -0.71 -1.01
Numpy Results
print(np.cov(d, rowvar = 0))
[[ 0.61655556 0.61544444]
[ 0.61544444 0.71655556]]
sklearn Results
from sklearn.decomposition import PCA
clf = PCA()
clf.fit(d.values)
print(clf.get_covariance())
[[ 0.5549 0.5539]
[ 0.5539 0.6449]]
Upvotes: 1
Views: 768
Reputation: 35
So I've encountered the same issue, and I think that it returns different values because the covariance is calculated in a different way. According to the sklearn documentation, the get_covariance()
method, uses the noise variances to obtain the covariance matrix.
Upvotes: 0
Reputation: 6715
Because for np.cov
,
Default normalization is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is 1, then normalization is by N.
Set bias=1
, the result is the same as PCA
:
In [9]: np.cov(df, rowvar=0, bias=1)
Out[9]:
array([[ 0.5549, 0.5539],
[ 0.5539, 0.6449]])
Upvotes: 2