Reputation: 197
I'm wondering if there is a fast way to do running correlation in Python with one fixed series? I've tried to use Pandas and for example: df1.rolling(4).corr(df2). However, it requires two DataFrames to have the same length. Is there a way to do similiar to the above Pandas example, but with one DataFrame being fixed?
To clarify, I would want to calculate the correlation coefficent between df2 below and the values in df1.
Example: First correlation between df2 and df1.loc[0:3] Second correlation between df2 and df1.loc[1:4]
etc.
I've managed to do this by creating a loop. However, I find it inefficent when working with larger DataFrames.
df1 = pd.DataFrame([1,3,2,4,5,6,3,4])
df2 = pd.DataFrame([1,2,3,2])
Upvotes: 2
Views: 646
Reputation: 33770
You can use the pandas.DataFrame.rolling
which returns
pandas.core.window.Rolling
which has apply method. Then you could pass to apply()
any function that calculates the correction you want.
import pandas as pd
from scipy.stats import pearsonr
import numpy as np
df1 = pd.DataFrame([1,3,2,4,5,6,3,4,1,2,3,2,2,3,2,5,1,2,1,2,8,8,8,8,8,8,8])
df2 = pd.DataFrame([1,2,3,2])
CORR_VALS = df2[0].values
def get_correlation(vals):
return pearsonr(vals, CORR_VALS)[0]
df1['correlation'] = df1.rolling(window=len(CORR_VALS)).apply(get_correlation)
window
argument in the df1.rolling()
should have the same length as the array you are calculating correlation against.this outputs
In [5]: df1['correlation'].values
Out[5]:
array([ nan, nan, nan, 0.31622777, 0.31622777,
0.71713717, 0.63245553, -0.63245553, -0.39223227, -0.63245553,
-0.63245553, 1. , 0. , -0.70710678, 0.81649658,
0. , 0.47809144, -0.23570226, -0.64699664, 0. ,
0. , 0.7570333 , 0.76509206, 0.11043153, -0.77302068,
-0.11043153, 0.86164044])
which would look like this:
Upvotes: 2