apang
apang

Reputation: 103

Intuition behind the correlation

I'm following this tutorial online from kaggle and I can't get my head round why .T is changing the shape of the matrix. Here is the part I am stuck at:

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

enter image description here

I'm basically trouble shooting the code and tried this:

 cm = np.corrcoef(df_train[cols].values)
 cm.shape

returns a matrix with shape 1460x1460. But when I input:

 cm = np.corrcoef(df_train[cols].values.T)
 cm.shape

it returns a matrix with shape 10x10. Does anyone know why it does this? I can't figure out.

Upvotes: 0

Views: 284

Answers (2)

yatu
yatu

Reputation: 88236

The correlation gives you a normalized representation of the covariance matrix between all the "columns" of the dataframe. For instance, in the case of having only two variables, you'd end up with a matrix of the shape:

Rx =  [[   1,    r_xy],
       [r_yx,       1]]

This is quite an expensive computation, since it involves taking the dot product of each column with the rest, resulting in a correlation coefficient for each combination.

So in matrix notation, since you want to end up with a 10x10 matrix, you want to have the shapes correctly aligned. In this case you want (10,1460)x(1460,10) so you get a 10,10 matrix. Hence you need to transpose the 2D-array so that it has shape (10,1460) when you feed it to np.corrcoef.

Though you might find it a little easier by playing around with it yourself and seeing how the actual Pearson correlation is computed:

X = np.random.randint(0,10,(500,2))
print(np.corrcoef(X.T))

array([[1.        , 0.04400245],
       [0.04400245, 1.        ]])

Which is doing the same as:

mean_X = X.mean(axis=0)
std_X = X.std(axis=0)
n, _ = X.shape

print((X.T-mean_X[:,None]).dot(X-mean_X)/(n*std_X**2))

array([[1.        , 0.04416552],
       [0.04383998, 1.        ]])

Note that as mentioned, this is giving as result a normalized dot product of X with itself, so for each (1,1460)x(1460,1) product your getting a single number. So X here, just as in your example, has to be transposed so the dimensions are correctly aligned.

Upvotes: 3

Bruno Mello
Bruno Mello

Reputation: 4618

From numpy documentation of corrcoef:

x : array_like
A 1-D or 2-D array containing multiple variables and observations. 
Each row of x represents a variable, and 
each column a single observation of all those variables. Also see rowvar below.

Note that each row represents a variable, in the first case you have 1460 rows and 10 columns and in the second one you have 10 rows with 1460 columns.

So when you transpose your NumPy array your basically changing from 1460 variables with 10 values for each one to 10 variables with 1460 values for each one.

If you are dealing with pandas you could just use the built-in .corr() method that computes the correlation between columns.

Upvotes: 1

Related Questions