Reputation: 1433
I would like to get the pearson r between x1 and each of the three columns in y, in an efficient manner.
It appears that pd.corrwith() is only able to calculate this for columns that have exactly the same column labels e.g. x and y.
This seems a bit impractical, as I presume computing correlations between different variables would be a common problem.
In [1]: import pandas as pd; import numpy as np
In [2]: x = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
In [3]: y = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
In [4]: x1 = pd.DataFrame(x.ix[:,0])
In [5]: x.corrwith(y)
Out[5]:
A -0.752631
B -0.525705
C 0.516071
dtype: float64
In [6]: x1.corrwith(y)
Out[6]:
A -0.752631
B NaN
C NaN
dtype: float64
Upvotes: 5
Views: 8163
Reputation: 131
You can accomplish what you want using DataFrame.corrwith(Series)
rather than DataFrame.corrwith(DataFrame)
:
In [203]: x1 = x['A']
In [204]: y.corrwith(x1)
Out[204]:
A 0.347629
B -0.480474
C -0.729303
dtype: float64
Alternatively, you can form the matrix of correlations between each column of x
and each column of y
as follows:
In [214]: pd.expanding_corr(x, y, pairwise=True).iloc[-1, :, :]
Out[214]:
A B C
A 0.347629 -0.480474 -0.729303
B -0.334814 0.778019 0.654583
C -0.453273 0.212057 0.149544
Alas DataFrame.corrwith()
doesn't have a pairwise=True
option.
Upvotes: 13
Reputation: 10302
You might do this (with np.random.seed(0)
):
x1 = pd.DataFrame(pd.Series(x.ix[:,0]).repeat(x.shape[1]).reshape(x.shape), columns=x.columns)
x1.corrwith(y)
to get this result:
A -0.509
B 0.041
C -0.732
Upvotes: 0