Reputation: 25
I have a huge timeseries dataframe with over 1m rows. It has list of monthly stock returns and I'd like to create a new row to track the rolling sum of 3 previous months. The dataframe has all company A rows first, then all company B rows, then all company C rows...
For example:
date COMNAM PRC RET
395 2017-02-28 GAS NATURAL INC 12.650 0.000
396 2017-03-31 GAS NATURAL INC 12.700 0.010
397 2017-04-28 GAS NATURAL INC 12.500 -0.016
398 2017-05-31 GAS NATURAL INC 12.700 0.016
399 2017-06-30 GAS NATURAL INC 12.925 0.024
400 2017-07-31 GAS NATURAL INC 12.950 0.002
401 2017-08-31 GAS NATURAL INC nan nan
402 1985-12-31 NaN nan nan
403 1986-01-31 MOBILE NATIONAL CORP 11.625 nan
404 1986-02-28 MOBILE NATIONAL CORP 13.250 0.140
405 1986-03-31 MOBILE NATIONAL CORP 14.188 0.071
406 1986-04-30 MOBILE NATIONAL CORP 14.938 0.053
407 1986-05-30 MOBILE NATIONAL CORP 14.625 -0.021
408 1986-06-30 MOBILE NATIONAL CORP 12.688 -0.132
409 1986-07-31 MOBILE NATIONAL CORP 13.312 0.049
410 1986-08-29 MOBILE NATIONAL CORP 13.312 0.000
411 1986-09-30 MOBILE NATIONAL CORP 14.250 0.070
412 1986-10-31 MOBILE NATIONAL CORP 13.375 -0.061
413 1986-11-28 MOBILE NATIONAL CORP 13.375 0.000
414 1986-12-31 MOBILE NATIONAL CORP 12.375 -0.075
The rolling()-function would give me the sum of 3 previous months, but that would include the last returns of the previous stock for the first dates of each company. I have a feeling the groupby() function might be able to help, but I'm kinda stuck on how. Or am I overthinking it, and there is some better way where I don't even need the groupby?
Upvotes: 1
Views: 42
Reputation: 30991
To compute the rolling sum of 3 previous months (without the current month), for two columns of interest, from the current group of rows, define the following function:
def mySum(grp):
return grp[['PRC', 'RET']].shift().rolling(3).sum()
Then, to get such rolling sums for each group (company), run:
result = df.join(df[df.COMNAM.notnull()].groupby('COMNAM').apply(mySum)\
.reset_index(level=0, drop=True).add_prefix('r'))
The result is a join between the current df and the result of invocation of the above function for each group (company). Column names of the intermediate result are prepended with r, to mark the rolling sum.
For your data sample the result is:
date COMNAM PRC RET rPRC rRET
0 2017-02-28 GAS NATURAL INC 12.650 0.000 NaN NaN
1 2017-03-31 GAS NATURAL INC 12.700 0.010 NaN NaN
2 2017-04-28 GAS NATURAL INC 12.500 -0.016 NaN NaN
3 2017-05-31 GAS NATURAL INC 12.700 0.016 37.850 -0.006
4 2017-06-30 GAS NATURAL INC 12.925 0.024 37.900 0.010
5 2017-07-31 GAS NATURAL INC 12.950 0.002 38.125 0.024
6 2017-08-31 GAS NATURAL INC NaN NaN 38.575 0.042
7 1985-12-31 NaN NaN NaN NaN NaN
8 1986-01-31 MOBILE NATIONAL CORP 11.625 NaN NaN NaN
9 1986-02-28 MOBILE NATIONAL CORP 13.250 0.140 NaN NaN
10 1986-03-31 MOBILE NATIONAL CORP 14.188 0.071 NaN NaN
11 1986-04-30 MOBILE NATIONAL CORP 14.938 0.053 39.063 NaN
12 1986-05-30 MOBILE NATIONAL CORP 14.625 -0.021 42.376 0.264
13 1986-06-30 MOBILE NATIONAL CORP 12.688 -0.132 43.751 0.103
14 1986-07-31 MOBILE NATIONAL CORP 13.312 0.049 42.251 -0.100
15 1986-08-29 MOBILE NATIONAL CORP 13.312 0.000 40.625 -0.104
16 1986-09-30 MOBILE NATIONAL CORP 14.250 0.070 39.312 -0.083
17 1986-10-31 MOBILE NATIONAL CORP 13.375 -0.061 40.874 0.119
18 1986-11-28 MOBILE NATIONAL CORP 13.375 0.000 40.937 0.009
19 1986-12-31 MOBILE NATIONAL CORP 12.375 -0.075 41.000 0.009
If you want to "ignore" NaN values (treat them as 0), change the function to:
def mySum(grp):
return grp[['PRC', 'RET']].fillna(0).shift().rolling(3).sum()
Upvotes: 1