sanguineturtle
sanguineturtle

Reputation: 1455

Pandas: Timing difference between Function and Apply to Series

I am trying to figure out why these two methods differ in %timeit results.

import pandas as pd
import numpy as np
d = pd.DataFrame(data={'S1' : [2,3,4,5,6,7,2], 'S2' : [4,5,2,3,4,6,8]}, \
                 index=[1,2,3,4,5,6,7])

%timeit pd.rolling_mean(d, window=3, center=True)
10000 loops, best of 3: 182 µs per loop

%timeit d.apply(lambda x: pd.rolling_mean(x, window=3, center=True))
1000 loops, best of 3: 695 µs per loop

Why is the apply(lambda) method ~3.5 x slower. In more complex dataframes, I have noticed a larger difference (~10 x).

Does the lambda method create a copy of the data in this operation?

Upvotes: 1

Views: 1413

Answers (1)

Karl D.
Karl D.

Reputation: 13757

Looks like most of the performance difference in this example can be eliminated with the raw=True option:

%timeit pd.rolling_mean(d, window=3, center=True)
1000 loops, best of 3: 281 µs per loop

%timeit d.apply(lambda x: pd.rolling_mean(x, window=3, center=True))
1000 loops, best of 3: 1.02 ms per loop

Now add the Raw=True option:

%timeit d.apply(lambda x: pd.rolling_mean(x, window=3, center=True),raw=True)
1000 loops, best of 3: 289 µs per loop

Adding reduce=False gets you a minor speed-up since pandas doesn't have to guess the return:

%timeit d.apply(lambda x: pd.rolling_mean(x, window=3,center=True),raw=True,reduce=False)
1000 loops, best of 3: 285 µs per loop

So in this case it looks like most of the performance difference is related apply converting each column to a Series and passing each series separately to rolling_mean. Having it use Raw=True has it just pass ndarrays.

Upvotes: 4

Related Questions