Faster way to take rolling forward sum after groupby?

Question

I'm trying to get the forward n minute return of stocks per day, given a dataframe with rows corresponding to returns at some intervals.

I've tried using dask and multithreading the rolling calculation for each group, but this seems to be the fastest way to do it that I can figure out. Yet for a large dataframe (millions of rows) (252 days and 1000 stocks) it takes up to 40 minutes to do this step.

ret_df.sort_values(['date','time','stock'], ascending=False, inplace=True)
gb = ret_df.groupby(['date','stock'])
forward_sum_df = gb.rolling(4, on='time', min_periods = 0)['interval_return'].sum().reset_index()

This would return the sum of the next 4 times (by date and stock) for each row in the dataframe, as expected, but does it quite slowly. Thanks for the help!

EDIT : added example to clarify

          date    stock            time      interval_ret
0   2017-01-03  10000001    09:30:00.000000   0.001418
1   2017-01-03  10000001    09:40:00.000000   0.000000
2   2017-01-03  10000001    09:50:00.000000   0.000000
3   2017-01-03  10000001    10:00:00.000000  -0.000474
4   2017-01-03  10000001    10:10:00.000000  -0.001417
5   2017-01-03  10000001    10:20:00.000000  -0.000944
6   2017-01-03  10000001    10:30:00.000000   0.000000
7   2017-01-03  10000001    10:40:00.000000   0.000000
8   2017-01-03  10000001    10:50:00.000000   0.000000
9   2017-01-03  10000001    11:00:00.000000  -0.000472

and so on for stock 10000002... and date 2017-01-04....

For instance, if my holding period is 30 minutes instead of 10 minutes, I'd like to sum up 3 rows of 'interval_ret', grouped by date and stock. Ex:

        date      stock            time           interval_ret_30
0   2017-01-03  10000001    09:30:00.000000   0.001418
1   2017-01-03  10000001    09:40:00.000000   0.000000 - 0.000474
2   2017-01-03  10000001    09:50:00.000000   0.000000 - 0.000474 - 0.001417
3   2017-01-03  10000001    10:00:00.000000  -0.000474 - 0.001417 - 0.000944
4   2017-01-03  10000001    10:10:00.000000  -0.001417 - 0.000944
5   2017-01-03  10000001    10:20:00.000000  -0.000944
6   2017-01-03  10000001    10:30:00.000000   0.000000
7   2017-01-03  10000001    10:40:00.000000  -0.000472
8   2017-01-03  10000001    10:50:00.000000  -0.000472
9   2017-01-03  10000001    11:00:00.000000  -0.000472

Alain T. · Accepted Answer

I don't know if you can adapt this to pandas, but you can get rolling cumulative sums for 20 million values in under a second using numpy:

N         = 20000000
stocks    = (np.random.random(N)*100)
window    = 4
cumStocks = np.cumsum(np.append(stocks,np.zeros(window)))
rollSum   = cumStocks[window:] - cumStocks[:-window]

The trick is to compute the cumulative sum for the whole array, and then subtract the resulting array from itself with an offset corresponding to the size of your window.

The cumsum source array is padded with zeroes to keep the original size. The last few elements that are closer to the end of the array than the window size will get a rolling sum of only the remaining values. If you don't need these "incomplete" sums, you can simply use cumStocks = np.cumsum(stocks) and the calculation will be able to do 100 million values in under a second.

Someone seems to have found a solution to this using pandas here: https://stackoverflow.com/a/56886389/5237560

df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)

Faster way to take rolling forward sum after groupby?

Answers (1)

Related Questions