Reputation: 668
Lets say I have a pandas.Dataframe that looks as follows:
c1 | c2
-------
1 | 5
2 | 6
3 | 7
4 | 8
.....
1 | 7
and I'm looking to map a function (DataFrame.corr
) but I would like it to take n
rows at a time. The result should be a series with the correlation values that would be shorter than the original DataFrame or with a few values that didn't get a full n
rows of data.
Is there a way to do this and how? I've been looking through the DataFrame and Map, Apply, Filter documentation but it doesn't seem to have an obvious or clean solution.
Upvotes: 2
Views: 1411
Reputation: 294586
With pandas 0.20, using rolling
with corr
produces a multi indexed dataframe. You can slice afterwards to get at what you're looking for.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['c1', 'c2'])
c1 c2
0 0 2
1 7 3
2 8 7
3 0 6
4 8 6
5 0 2
6 0 4
7 9 7
8 3 2
9 4 3
rolling
+ corr
... pandas 0.20.x
df.rolling(5).corr().dropna().c1.xs('c2', level=1)
# Or equivalently
# df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c1, dtype: float64
rolling
+ corr
... pandas 0.19.x
or prior
Prior to 0.20, rolling
+ corr
produced a pd.Panel
df.rolling(5).corr().loc[:, 'c1', 'c2'].dropna()
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c2, dtype: float64
numpy
+ as_strided
However, I wasn't satisfied with the above answers. Below is a specialized function that takes an nx2
dataframe and returns a series of the rolling correlations. DISCLAIMER This uses some advanced techniques and should really only be used if you know what this does. Meaning if you need a detailed breakdown of how this works... then it probably isn't for you.
from numpy.lib.stride_tricks import as_strided as strided
def rolling_correlation(a, w):
n, m = a.shape[0], 2
s1, s2 = a.strides
b = strided(a, (m, w, n - w + 1), (s2, s1, s1))
b_mb = b - b.mean(1, keepdims=True)
b_ss = (b_mb ** 2).sum(1) ** .5
return (b_mb[0] * b_mb[1]).sum(0) / (b_ss[0] * b_ss[1])
def rolling_correlation_df(df, w):
a = df.values
return pd.Series(rolling_correlation(a, w), df.index[w-1:])
rolling_correlation_df(df, 5)
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
dtype: float64
Timing
small data
%timeit rolling_correlation_df(df, 5)
10000 loops, best of 3: 79.9 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
100 loops, best of 3: 14.6 ms per loop
large data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10000, 2)), columns=['c1', 'c2'])
%timeit rolling_correlation_df(df, 5)
1000 loops, best of 3: 615 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
1 loop, best of 3: 1.98 s per loop
Upvotes: 3