Reputation: 19395
I have a dataframe as follows
df = pd.DataFrame({ 'X' : np.random.randn(50000)}, index=pd.date_range('1/1/2000', periods=50000, freq='T'))
df.head(10)
Out[37]:
X
2000-01-01 00:00:00 -0.699565
2000-01-01 00:01:00 -0.646129
2000-01-01 00:02:00 1.339314
2000-01-01 00:03:00 0.559563
2000-01-01 00:04:00 1.529063
2000-01-01 00:05:00 0.131740
2000-01-01 00:06:00 1.282263
2000-01-01 00:07:00 -1.003991
2000-01-01 00:08:00 -1.594918
2000-01-01 00:09:00 -0.775230
I would like to create a variable that contains the sum
of X
In other words:
2000-01-01 00:00:00
, df['rolling_sum_same_hour']
contains the sum the values of X observed at 00:00:00
during the last 5 days in the data (not including 2000-01-01
of course). 2000-01-01 00:01:00
, df['rolling_sum_same_hour']
contains the sum of of X observed at 00:00:01
during the last 5 days and so on. The intuitive idea is that intraday prices have intraday seasonality, and I want to get rid of it that way.
I tried to use df['rolling_sum_same_hour']=df.at_time(df.index.minute).rolling(window=5).sum()
with no success. Any ideas?
Many thanks!
Upvotes: 4
Views: 2732
Reputation: 637
Behold the power of groupby
!
df = # as you defined above
df['rolling_sum_by_time'] = df.groupby(df.index.time)['X'].apply(lambda x: x.shift(1).rolling(10).sum())
It's a big pill to swallow there, but we are grouping by time (as in python datetime.time), then getting the column we care about (else apply will work on columns - it now works on the time-groups), and then applying the function you want!
Upvotes: 3
Reputation: 76366
IIUC, what you want is to perform a rolling sum, but only on the observations grouped by the exact same time of day. This can be done by
df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum())
(Note that your question alternates between 5 and 10 periods.) For example:
In [43]: df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum()).tail()
Out[43]:
2000-02-04 17:15:00 -2.135887
2000-02-04 17:16:00 -3.056707
2000-02-04 17:17:00 0.813798
2000-02-04 17:18:00 -1.092548
2000-02-04 17:19:00 -0.997104
Freq: T, Name: X, dtype: float64
Upvotes: 2