Reputation: 4807

Finding correlation for corresponding columns in dataframe

I have two data frames with 200 columns each. For illustration I am using only 3 columns here.

Dataframe df1 as:

            A   B   C
1/4/2017    5   6   6
1/5/2017    5   2   1
1/6/2017    6   2   10
1/9/2017    1   9   10
1/10/2017   6   6   4
1/11/2017   6   1   1
1/12/2017   1   7   10
1/13/2017   8   9   6

Dataframe df2:

            A   D   B
1/4/2017    8   10  2
1/5/2017    2   1   8
1/6/2017    6   6   6
1/9/2017    1   8   1
1/10/2017   10  6   2
1/11/2017   10  2   4
1/12/2017   5   4   10
1/13/2017   5   2   8

I want to calculate the following correlation matrix for corresponding columns of df1 and df2:

            A       B
1/4/2017        
1/5/2017        
1/6/2017    0.19    -0.94
1/9/2017    0.79    -0.96
1/10/2017   0.90    -0.97
1/11/2017   1.00    -1.00
1/12/2017   1.00    0.42
1/13/2017   0.24    0.84

i.e. using trailing 3 day historical data for same columns of df1 and df2, I need to find the correlation matrix.

so, I calculated corr([5, 5, 6], [8, 2, 6]) = 0.19 where [5,5,6] is from df1['A'] and [8,2,6] is from df2['A']

Since, I have 200 columns each I am finding it extremely cumbersome to run a for loop two times. First loop through columns and second using trailing 3 day lag data.

Upvotes: 2

Answers (2)

piRSquared

Reputation: 294506

Option 1
I built a generator and wrapped it in pd.concat

def rolling_corrwith(d1, d2, window):
    d1, d2 = d1.align(d2, 'inner')
    for i in range(len(d1) - window + 1):
        j = i + window
        yield d1.iloc[i:j].corrwith(d2.iloc[i:j]).rename(d1.index[j-1])

pd.concat(list(rolling_corrwith(df1, df2, 3)), axis=1).T

                  A         B
1/6/2017   0.188982 -0.944911
1/9/2017   0.785714 -0.960769
1/10/2017  0.896258 -0.968620
1/11/2017  1.000000 -0.998906
1/12/2017  1.000000  0.423415
1/13/2017  0.240192  0.838628

Option 2
Using numpy strides. I don't recommend this approach. But it's worth mentioning for those who are interested.

from numpy.lib.stride_tricks import as_strided as strided

def sprp(v, w):
    s0, s1 = v.strides
    n, m = v.shape
    return strided(v, (n + 1 - w, w, m), (s0, s0, s1))

def rolling_corrwith2(d1, d2, window):
    d1, d2 = d1.align(d2, 'inner')

    s1 = sprp(d1.values, window)
    s2 = sprp(d2.values, window)

    m1 = s1.mean(1, keepdims=1)
    m2 = s2.mean(1, keepdims=1)
    z1 = s1.std(1)
    z2 = s2.std(1)

    c  = ((s1 - m1) * (s2 - m2)).sum(1) / z1 / z2 / window

    return pd.DataFrame(c, d1.index[window - 1:], d1.columns)

rolling_corrwith2(df1, df2, 3)

                  A         B
1/6/2017   0.188982 -0.944911
1/9/2017   0.785714 -0.960769
1/10/2017  0.896258 -0.968620
1/11/2017  1.000000 -0.998906
1/12/2017  1.000000  0.423415
1/13/2017  0.240192  0.838628

Upvotes: 3

BENY

Reputation: 323366

Is this what you need ?

l=[]
id=df1.columns.intersection(df2.columns)
for x in id:
    l.append(pd.rolling_corr(df1[x],df2[x],window=3))# notice you should change it to `l.append(df1[x].rolling(3).corr(df2[x]))`

pd.concat(l,axis=1)


Out[13]: 
                  A         B
1/4/2017        NaN       NaN
1/5/2017        NaN       NaN
1/6/2017   0.188982 -0.944911
1/9/2017   0.785714 -0.960769
1/10/2017  0.896258 -0.968620
1/11/2017  1.000000 -0.998906
1/12/2017  1.000000  0.423415
1/13/2017  0.240192  0.838628

Upvotes: 4

Finding correlation for corresponding columns in dataframe

Answers (2)

Related Questions