Reputation: 4807
I have two data frames with 200 columns each. For illustration I am using only 3 columns here.
Dataframe df1 as:
A B C
1/4/2017 5 6 6
1/5/2017 5 2 1
1/6/2017 6 2 10
1/9/2017 1 9 10
1/10/2017 6 6 4
1/11/2017 6 1 1
1/12/2017 1 7 10
1/13/2017 8 9 6
Dataframe df2:
A D B
1/4/2017 8 10 2
1/5/2017 2 1 8
1/6/2017 6 6 6
1/9/2017 1 8 1
1/10/2017 10 6 2
1/11/2017 10 2 4
1/12/2017 5 4 10
1/13/2017 5 2 8
I want to calculate the following correlation matrix for corresponding columns of df1
and df2
:
A B
1/4/2017
1/5/2017
1/6/2017 0.19 -0.94
1/9/2017 0.79 -0.96
1/10/2017 0.90 -0.97
1/11/2017 1.00 -1.00
1/12/2017 1.00 0.42
1/13/2017 0.24 0.84
i.e. using trailing 3 day historical data for same columns of df1
and df2
, I need to find the correlation matrix.
so, I calculated corr([5, 5, 6], [8, 2, 6]) = 0.19
where [5,5,6]
is from df1['A']
and [8,2,6]
is from df2['A']
Since, I have 200 columns each I am finding it extremely cumbersome to run a for loop two times. First loop through columns and second using trailing 3 day lag data.
Upvotes: 2
Views: 120
Reputation: 294506
Option 1
I built a generator and wrapped it in pd.concat
def rolling_corrwith(d1, d2, window):
d1, d2 = d1.align(d2, 'inner')
for i in range(len(d1) - window + 1):
j = i + window
yield d1.iloc[i:j].corrwith(d2.iloc[i:j]).rename(d1.index[j-1])
pd.concat(list(rolling_corrwith(df1, df2, 3)), axis=1).T
A B
1/6/2017 0.188982 -0.944911
1/9/2017 0.785714 -0.960769
1/10/2017 0.896258 -0.968620
1/11/2017 1.000000 -0.998906
1/12/2017 1.000000 0.423415
1/13/2017 0.240192 0.838628
Option 2
Using numpy strides. I don't recommend this approach. But it's worth mentioning for those who are interested.
from numpy.lib.stride_tricks import as_strided as strided
def sprp(v, w):
s0, s1 = v.strides
n, m = v.shape
return strided(v, (n + 1 - w, w, m), (s0, s0, s1))
def rolling_corrwith2(d1, d2, window):
d1, d2 = d1.align(d2, 'inner')
s1 = sprp(d1.values, window)
s2 = sprp(d2.values, window)
m1 = s1.mean(1, keepdims=1)
m2 = s2.mean(1, keepdims=1)
z1 = s1.std(1)
z2 = s2.std(1)
c = ((s1 - m1) * (s2 - m2)).sum(1) / z1 / z2 / window
return pd.DataFrame(c, d1.index[window - 1:], d1.columns)
rolling_corrwith2(df1, df2, 3)
A B
1/6/2017 0.188982 -0.944911
1/9/2017 0.785714 -0.960769
1/10/2017 0.896258 -0.968620
1/11/2017 1.000000 -0.998906
1/12/2017 1.000000 0.423415
1/13/2017 0.240192 0.838628
Upvotes: 3
Reputation: 323366
Is this what you need ?
l=[]
id=df1.columns.intersection(df2.columns)
for x in id:
l.append(pd.rolling_corr(df1[x],df2[x],window=3))# notice you should change it to `l.append(df1[x].rolling(3).corr(df2[x]))`
pd.concat(l,axis=1)
Out[13]:
A B
1/4/2017 NaN NaN
1/5/2017 NaN NaN
1/6/2017 0.188982 -0.944911
1/9/2017 0.785714 -0.960769
1/10/2017 0.896258 -0.968620
1/11/2017 1.000000 -0.998906
1/12/2017 1.000000 0.423415
1/13/2017 0.240192 0.838628
Upvotes: 4