themachinist
themachinist

Reputation: 1433

pd.corrwith on pandas dataframes with different column names

I would like to get the pearson r between x1 and each of the three columns in y, in an efficient manner.

It appears that pd.corrwith() is only able to calculate this for columns that have exactly the same column labels e.g. x and y.

This seems a bit impractical, as I presume computing correlations between different variables would be a common problem.

In [1]: import pandas as pd; import numpy as np

In [2]: x = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])

In [3]: y = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])

In [4]: x1 = pd.DataFrame(x.ix[:,0])

In [5]: x.corrwith(y)
Out[5]:
A   -0.752631
B   -0.525705
C    0.516071
dtype: float64

In [6]: x1.corrwith(y)
Out[6]:
A   -0.752631
B         NaN
C         NaN
dtype: float64

Upvotes: 5

Views: 8163

Answers (2)

seth-p
seth-p

Reputation: 131

You can accomplish what you want using DataFrame.corrwith(Series) rather than DataFrame.corrwith(DataFrame):

In [203]: x1 = x['A']

In [204]: y.corrwith(x1)
Out[204]:
A    0.347629
B   -0.480474
C   -0.729303
dtype: float64

Alternatively, you can form the matrix of correlations between each column of x and each column of y as follows:

In [214]: pd.expanding_corr(x, y, pairwise=True).iloc[-1, :, :]
Out[214]:
          A         B         C
A  0.347629 -0.480474 -0.729303
B -0.334814  0.778019  0.654583
C -0.453273  0.212057  0.149544

Alas DataFrame.corrwith() doesn't have a pairwise=True option.

Upvotes: 13

Primer
Primer

Reputation: 10302

You might do this (with np.random.seed(0)):

x1 = pd.DataFrame(pd.Series(x.ix[:,0]).repeat(x.shape[1]).reshape(x.shape), columns=x.columns)
x1.corrwith(y)

to get this result:

A   -0.509
B    0.041
C   -0.732

Upvotes: 0

Related Questions