Reputation: 165
I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.
Pandas has build-in methods for rolling and expanding calculations
Here's an example:
import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
data=np.random.normal(size=obs * k).reshape(obs, k),
index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)
To get rolling and expanding sums I can use
df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()
But this takes a long time for a very large number of groups. For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.
df.groupby(level=0).cumsum()
Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums? If not, could I use numpy to accomplish this?
Upvotes: 4
Views: 2534
Reputation: 193
To provide the latest information on this, if you upgrade pandas, the performance of groupby rolling has been significantly improved. This is approx 4-5 times faster in 1.1.0 and x12 faster in >1.2.0 compared to 0.24 or 1.0.0.
I believe the biggest performance improvement comes from this PR which means it can do more in cython (before it was implemented like groupby.apply(lambda x: x.rolling())
).
I used the below code to benchmark:
import pandas
import numpy
print(pandas.__version__)
print(numpy.__version__)
def stack_overflow_df():
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 2
df = pandas.DataFrame(
data=numpy.random.normal(size=obs * k).reshape(obs, k),
index=pandas.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)
return df
df = stack_overflow_df()
# N.B. droplevel important to make indices match
rolling_result = (
df.groupby(level=0)[[0, 1]].rolling(10, min_periods=1).sum().droplevel(level=0)
)
df[["value_0_rolling_sum", "value_1_rolling_sum"]] = rolling_result
%%timeit
# results:
# numpy version always 1.19.4
# pandas 0.24 = 12.3 seconds
# pandas 1.0.5 = 12.9 seconds
# pandas 1.1.0 = broken with groupby rolling bug
# pandas 1.1.1 = 2.9 seconds
# pandas 1.1.5 = 2.5 seconds
# pandas 1.2.0 = 1.06 seconds
# pandas 1.2.2 = 1.06 seconds
I think care must be taken if trying to use numpy.cumsum to improve performance (regardless of pandas version). For example, using something like the below:
# Gives different output
df.groupby(level=0)[[0, 1]].cumsum() - df.groupby(level=0)[[0, 1]].cumsum().shift(10)
While this is much faster, the output is not correct. This shift is performed over all rows and mixes the cumsum of different groups. i.e. The first result of the next group is shifted back to the previous group.
To have the same behaviour as above, you need to use apply:
df.groupby(level=0)[[0, 1]].cumsum() - df.groupby(level=0)[[0, 1]].apply(
lambda x: x.cumsum().shift(10).fillna(0)
)
which, in the most recent version (1.2.2), is slower than using rolling directly. Hence, for groupby rolling sums, I don't think numpy.cumsum is the best solution for pandas>=1.1.1
For completeness, if your groups are columns rather than the index, you should use syntax like this:
# N.B. reset_index important to make indices match
rolling_result = (
df.groupby(["category_0", "category_1"])[["value_0", "value_1"]]
.rolling(10, min_periods=1)
.sum()
.reset_index(drop=True)
)
df[["value_0_rolling_sum", "value_1_rolling_sum"]] = rolling_result
Upvotes: 1
Reputation: 984
I had the same experience with .rolling()
its nice, but only with small datasets or if the function you are applying is non standard, with sum()
I would suggest using cumsum()
and subtracting cumsum().shift(5)
df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)
Upvotes: 3