Intuition behind the correlation

Question

I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

I'm basically trouble shooting the code and tried this:

 cm = np.corrcoef(df_train[cols].values)
 cm.shape

returns a matrix with shape 1460x1460. But when I input:

 cm = np.corrcoef(df_train[cols].values.T)
 cm.shape

it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.

yatu · Accepted Answer

The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:

Rx =  [[   1,    r_xy],
       [r_yx,       1]]

This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.

So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.

Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:

X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))

array([[1.        , 0.04400245],
       [0.04400245, 1.        ]])

Which is doing the same as:

mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape

print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))

array([[1.        , 0.04416552],
       [0.04383998, 1.        ]])

Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.

Intuition behind the correlation

Answers (2)

Related Questions