Use Map function with subsets of a pandas series or dataframe

Question

Lets say I have a pandas.Dataframe that looks as follows:

c1 | c2
-------
 1 | 5 
 2 | 6 
 3 | 7 
 4 | 8
 .....
 1 | 7

and I'm looking to map a function (DataFrame.corr) but I would like it to take n rows at a time. The result should be a series with the correlation values that would be shorter than the original DataFrame or with a few values that didn't get a full n rows of data.

Is there a way to do this and how? I've been looking through the DataFrame and Map, Apply, Filter documentation but it doesn't seem to have an obvious or clean solution.

piRSquared · Accepted Answer

With pandas 0.20, using rolling with corr produces a multi indexed dataframe. You can slice afterwards to get at what you're looking for.

Consider the dataframe df

np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['c1', 'c2'])

   c1  c2
0   0   2
1   7   3
2   8   7
3   0   6
4   8   6
5   0   2
6   0   4
7   9   7
8   3   2
9   4   3

rolling + corr... pandas 0.20.x

df.rolling(5).corr().dropna().c1.xs('c2', level=1)
# Or equivalently
# df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])

4    0.399056
5    0.399056
6    0.684653
7    0.696074
8    0.841136
9    0.762187
Name: c1, dtype: float64

rolling + corr... pandas 0.19.x or prior
Prior to 0.20, rolling + corr produced a pd.Panel

df.rolling(5).corr().loc[:, 'c1', 'c2'].dropna()

4    0.399056
5    0.399056
6    0.684653
7    0.696074
8    0.841136
9    0.762187
Name: c2, dtype: float64

numpy + as_strided
However, I wasn't satisfied with the above answers. Below is a specialized function that takes an nx2 dataframe and returns a series of the rolling correlations. DISCLAIMER This uses some advanced techniques and should really only be used if you know what this does. Meaning if you need a detailed breakdown of how this works... then it probably isn't for you.

from numpy.lib.stride_tricks import as_strided as strided

def rolling_correlation(a, w):
    n, m = a.shape[0], 2
    s1, s2 = a.strides
    b = strided(a, (m, w, n - w + 1), (s2, s1, s1))
    b_mb = b - b.mean(1, keepdims=True)
    b_ss = (b_mb ** 2).sum(1) ** .5
    return (b_mb[0] * b_mb[1]).sum(0) / (b_ss[0] * b_ss[1])

def rolling_correlation_df(df, w):
    a = df.values
    return pd.Series(rolling_correlation(a, w), df.index[w-1:])

rolling_correlation_df(df, 5)

4    0.399056
5    0.399056
6    0.684653
7    0.696074
8    0.841136
9    0.762187
dtype: float64

Timing
small data

%timeit rolling_correlation_df(df, 5)
10000 loops, best of 3: 79.9 µs per loop

%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
100 loops, best of 3: 14.6 ms per loop

large data

np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10000, 2)), columns=['c1', 'c2'])

%timeit rolling_correlation_df(df, 5)
1000 loops, best of 3: 615 µs per loop

%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
1 loop, best of 3: 1.98 s per loop

Use Map function with subsets of a pandas series or dataframe

Answers (1)

Related Questions