bazel
bazel

Reputation: 299

Faster rolling_apply on a Pandas DataFrame?

Improving upon this question which provided a clever solution for applying a function over multiple columns in a DataFrame, I'm wondering if the solution can be further optimized for speed.

Environment: Python 2.7.8, Pandas 14.1, Numpy 1.8.

Here's the example setup:

import pandas as pd
import numpy as np
import random

def meanmax(ii,df):
    xdf = df.iloc[map(int,ii)]
    n = max(xdf['A']) + max(xdf['B'])
    return n / 2.0

df  = pd.DataFrame(np.random.randn(2500,2)/10000, 
                    index=pd.date_range('2001-01-01',periods=2500),
                    columns=['A','B'])              
df['ii'] = range(len(df))      

res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))

Note that the meanmax function is not pairwise, thus something like rolling_mean(df['A'] + df['B'],26) won't work.

However I can do something like:

res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2

Which completes roughly 3000x faster:

%timeit res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
1 loops, best of 3: 1 s per loop

%timeit res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2
1000 loops, best of 3: 325 µs per loop

Is there anything better/equivalent than the second option above, given the example function and using rolling_apply? While the second option is faster, it doesn't use a rolling_apply, which can be applied to a wider problem set

Edit: Performance timing correction

Upvotes: 3

Views: 5786

Answers (2)

DSM
DSM

Reputation: 353059

You won't be able to get down to rolling_max speed, but you can often shave off an order of magnitude or so by dropping down to numpy via .values:

def meanmax_np(ii, df):
    ii = ii.astype(int)
    n = df["A"].values[ii].max() + df["B"].values[ii].max()
    return n/2.0

gives me

>>> %timeit res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
1 loops, best of 3: 701 ms per loop
>>> %timeit res_np = pd.rolling_apply(df.ii, 26, lambda x: meanmax_np(x, df))
10 loops, best of 3: 31.2 ms per loop
>>> %timeit res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2
1000 loops, best of 3: 247 µs per loop

which though still 100x slower than the optimized case is much faster than the original. Sometimes when I only need something to be ten times faster for it not to be the dominant timesink that suffices.

Upvotes: 3

Jaime
Jaime

Reputation: 67427

Computing a generic rolling function over an array of size n with a window of size m requires roughly O(n*m) time. The built in rollin_xxx methods use some pretty smart algorithms to keep running time well below that, and can often guarantee O(n) time, which if you think of it is a pretty impressive thing.

rolling_min and rolling_max in particular borrowed their implementation from bottleneck, which cites Richard Harter as the source of the algorithm, although I found what I think is an earlier description of the same algorithm in this paper.

So after the history lesson: it is very likely that you will not be able to have your cake an eat it. rolling_apply is very convenient, but it is almost always going to sacrifice performance against a specific algorithm. In my experience, one of the more enjoyable parts of using the Python scientific stack is coming up with efficient ways of doing computations, using the fast primitives provided in creative ways. Your own solution calling rolling_max twice is a good example of this. So relax and enjoy the ride, knowing that you will always have rolling_apply to fall back on if you, or the good people of SO, cannot come with a smarter solution.

Upvotes: 6

Related Questions