danyellzz
danyellzz

Reputation: 91

Why do np.corrcoef(x) and df.corr() give different results?

Why the numpy correlation coefficient matrix and the pandas correlation coefficient matrix different when using np.corrcoef(x) and df.corr()?

x = np.array([[0, 2, 7], [1, 1, 9], [2, 0, 13]]).T
x_df = pd.DataFrame(x)
print("matrix:")
print(x)
print()
print("df:")
print(x_df)
print()

print("np correlation matrix: ")
print(np.corrcoef(x))
print()
print("pd correlation matrix: ")

print(x_df.corr())
print()

Gives me the output

matrix:
[[ 0  1  2]
 [ 2  1  0]
 [ 7  9 13]]

df:
   0  1   2
0  0  1   2
1  2  1   0
2  7  9  13

np correlation matrix: 
[[ 1.         -1.          0.98198051]
 [-1.          1.         -0.98198051]
 [ 0.98198051 -0.98198051  1.        ]]

pd correlation matrix: 
          0         1         2
0  1.000000  0.960769  0.911293
1  0.960769  1.000000  0.989743
2  0.911293  0.989743  1.000000

I'm guessing they are different types of correlation coefficients?

Upvotes: 9

Views: 8784

Answers (1)

Paul Brennan
Paul Brennan

Reputation: 2696

@AlexAlex is right, you are taking a different set of numbers in the correlation coefficients.

Think about it in a 2x3 matrix

x = np.array([[0, 2, 7], [1, 1, 9]])
np.corrcoef(yx)

gives

array([[1.        , 0.96076892],
       [0.96076892, 1.        ]])

and

x_df = pd.DataFrame(yx.T)
print(x_df)
x_df[0].corr(x_df[1])

gives

   0  1
0  0  1
1  2  1
2  7  9

0.9607689228305227

where the 0.9607... etc numbers match the output of the NumPy calculation.

If you do it the way in your calculation it is equivalent to comparing the correlation of the rows rather than the columns. You can fix the NumPy version using .T or the argument rowvar=False

Upvotes: 6

Related Questions