Reputation: 47
I have the following table indexed with a timestamp.
The data is during business hours 8 AM - 11 PM but extends over multiple days
When I perform a rolling statistic like mean, and specify the period to be 24 hours, the window overlaps from January 5th back into January 4th.
Specifically, I am wondering how I can efficiently perform the rolling period to ONLY include the current day that is in the window.
My current inefficient method is to create a custom function that selects valid timestamp indices for each rolling calculation, but this is extremely slow.
def mean(x):
x = x[(x.index.hour >= 8) & (x.index.hour <= 23)]
return 100.0 * (sum(x) / (len(x)))
Index | Value | Normal | Desired |
---|---|---|---|
2021-01-04 08:35:15 | 0 | 0 | 0 |
2021-01-04 10:35:45 | 0 | 0 | 0 |
2021-01-04 16:35:30 | 1 | 0.333 | 0.333 |
2021-01-04 21:35:00 | 1 | 0.5 | 0.5 |
2021-01-05 08:15:00 | 1 | 0.6 | 1.0 |
2021-01-05 08:35:15 | 0 | 0.5 | 0.5 |
2021-01-05 12:35:42 | 0 | 0.428 | 0.333 |
2021-01-05 14:35:24 | 1 | 0.5 | 0.5 |
2021-01-04 20:35:23 | 0 | 0.444 | 0.4 |
Upvotes: 0
Views: 380
Reputation: 425
You can get the desired result by chaining a groupby operation with a windowing operation. According to the documentation this will "first group the data by the specified keys and then perform a windowing operation per group".
In [711]: df.groupby(lambda x: x.date()).rolling('1D').mean().reset_index(0, drop=True)
Out[711]:
Value
2021-01-04 08:35:15 0.000000
2021-01-04 10:35:45 0.000000
2021-01-04 16:35:30 0.333333
2021-01-04 21:35:00 0.500000
2021-01-05 08:15:00 1.000000
2021-01-05 08:35:15 0.500000
2021-01-05 12:35:42 0.333333
2021-01-05 14:35:24 0.500000
2021-01-05 20:35:23 0.400000
Upvotes: 1