Reputation: 91
Why the numpy correlation coefficient matrix and the pandas correlation coefficient matrix different when using np.corrcoef(x) and df.corr()?
x = np.array([[0, 2, 7], [1, 1, 9], [2, 0, 13]]).T
x_df = pd.DataFrame(x)
print("matrix:")
print(x)
print()
print("df:")
print(x_df)
print()
print("np correlation matrix: ")
print(np.corrcoef(x))
print()
print("pd correlation matrix: ")
print(x_df.corr())
print()
Gives me the output
matrix:
[[ 0 1 2]
[ 2 1 0]
[ 7 9 13]]
df:
0 1 2
0 0 1 2
1 2 1 0
2 7 9 13
np correlation matrix:
[[ 1. -1. 0.98198051]
[-1. 1. -0.98198051]
[ 0.98198051 -0.98198051 1. ]]
pd correlation matrix:
0 1 2
0 1.000000 0.960769 0.911293
1 0.960769 1.000000 0.989743
2 0.911293 0.989743 1.000000
I'm guessing they are different types of correlation coefficients?
Upvotes: 9
Views: 8784
Reputation: 2696
@AlexAlex is right, you are taking a different set of numbers in the correlation coefficients.
Think about it in a 2x3 matrix
x = np.array([[0, 2, 7], [1, 1, 9]])
np.corrcoef(yx)
gives
array([[1. , 0.96076892],
[0.96076892, 1. ]])
and
x_df = pd.DataFrame(yx.T)
print(x_df)
x_df[0].corr(x_df[1])
gives
0 1
0 0 1
1 2 1
2 7 9
0.9607689228305227
where the 0.9607... etc numbers match the output of the NumPy calculation.
If you do it the way in your calculation it is equivalent to comparing the correlation of the rows rather than the columns. You can fix the NumPy version using .T
or the argument rowvar=False
Upvotes: 6