问题描述:

Trying to learn PCA through and through but interestingly enough when I use numpy and sklearn I get different covariance matrix results.

The numpy results match this explanatory text here but the sklearn results different from both.

Is there any reason why this is so?

d = pd.read_csv("example.txt", header=None, sep = " ")

print(d)

0 1

0 0.69 0.49

1 -1.31 -1.21

2 0.39 0.99

3 0.09 0.29

4 1.29 1.09

5 0.49 0.79

6 0.19 -0.31

7 -0.81 -0.81

8 -0.31 -0.31

9 -0.71 -1.01

Numpy Results

print(np.cov(d, rowvar = 0))

[[ 0.61655556 0.61544444]

[ 0.61544444 0.71655556]]

sklearn Results

from sklearn.decomposition import PCA

clf = PCA()

clf.fit(d.values)

print(clf.get_covariance())

[[ 0.5549 0.5539]

[ 0.5539 0.6449]]

网友答案:

Because for np.cov,

Default normalization is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is 1, then normalization is by N.

Set bias=1, the result is the same as PCA:

In [9]: np.cov(df, rowvar=0, bias=1)
Out[9]:
array([[ 0.5549,  0.5539],
       [ 0.5539,  0.6449]])
相关阅读:
Top