Reputation: 1455
I am trying to figure out why these two methods differ in %timeit
results.
import pandas as pd
import numpy as np
d = pd.DataFrame(data={'S1' : [2,3,4,5,6,7,2], 'S2' : [4,5,2,3,4,6,8]}, \
index=[1,2,3,4,5,6,7])
%timeit pd.rolling_mean(d, window=3, center=True)
10000 loops, best of 3: 182 µs per loop
%timeit d.apply(lambda x: pd.rolling_mean(x, window=3, center=True))
1000 loops, best of 3: 695 µs per loop
Why is the apply(lambda) method ~3.5 x slower. In more complex dataframes, I have noticed a larger difference (~10 x).
Does the lambda method create a copy of the data in this operation?
Upvotes: 1
Views: 1413
Reputation: 13757
Looks like most of the performance difference in this example can be eliminated with the raw=True
option:
%timeit pd.rolling_mean(d, window=3, center=True)
1000 loops, best of 3: 281 µs per loop
%timeit d.apply(lambda x: pd.rolling_mean(x, window=3, center=True))
1000 loops, best of 3: 1.02 ms per loop
Now add the Raw=True
option:
%timeit d.apply(lambda x: pd.rolling_mean(x, window=3, center=True),raw=True)
1000 loops, best of 3: 289 µs per loop
Adding reduce=False
gets you a minor speed-up since pandas doesn't have to guess the return:
%timeit d.apply(lambda x: pd.rolling_mean(x, window=3,center=True),raw=True,reduce=False)
1000 loops, best of 3: 285 µs per loop
So in this case it looks like most of the performance difference is related apply converting each column to a Series
and passing each series separately to rolling_mean. Having it use Raw=True
has it just pass ndarrays.
Upvotes: 4