Is it possible to do running correlation with one fixed series in Python?

Question

I'm wondering if there is a fast way to do running correlation in Python with one fixed series? I've tried to use Pandas and for example: df1.rolling(4).corr(df2). However, it requires two DataFrames to have the same length. Is there a way to do similiar to the above Pandas example, but with one DataFrame being fixed?

To clarify, I would want to calculate the correlation coefficent between df2 below and the values in df1.

Example: First correlation between df2 and df1.loc[0:3] Second correlation between df2 and df1.loc[1:4]

etc.

I've managed to do this by creating a loop. However, I find it inefficent when working with larger DataFrames.

df1 = pd.DataFrame([1,3,2,4,5,6,3,4])
df2 = pd.DataFrame([1,2,3,2])

Niko Fohr · Accepted Answer

You can use the pandas.DataFrame.rolling which returns pandas.core.window.Rolling which has apply method. Then you could pass to apply() any function that calculates the correction you want.

Example

Let's say you are interested in the Pearson correlation coefficient. That can be calculated using scipy.stats.pearsonr.

import pandas as pd
from scipy.stats import pearsonr 
import numpy as np 


df1 = pd.DataFrame([1,3,2,4,5,6,3,4,1,2,3,2,2,3,2,5,1,2,1,2,8,8,8,8,8,8,8])
df2 = pd.DataFrame([1,2,3,2])

CORR_VALS = df2[0].values
def get_correlation(vals):
    return pearsonr(vals, CORR_VALS)[0]

df1['correlation'] = df1.rolling(window=len(CORR_VALS)).apply(get_correlation)

Note that the window argument in the df1.rolling() should have the same length as the array you are calculating correlation against.

this outputs

In [5]: df1['correlation'].values
Out[5]:
array([        nan,         nan,         nan,  0.31622777,  0.31622777,
        0.71713717,  0.63245553, -0.63245553, -0.39223227, -0.63245553,
       -0.63245553,  1.        ,  0.        , -0.70710678,  0.81649658,
        0.        ,  0.47809144, -0.23570226, -0.64699664,  0.        ,
        0.        ,  0.7570333 ,  0.76509206,  0.11043153, -0.77302068,
       -0.11043153,  0.86164044])

which would look like this:

Is it possible to do running correlation with one fixed series in Python?

Answers (1)

Example

Related Questions