Reputation: 2324
I want to aggregate a pandas.Series
with an hourly DatetimeIndex
to monthly values - while considering the offset to midnight.
Consider the following (uniform) timeseries that spans about 1.5 months.
import pandas as pd
hours = pd.Series(1, pd.date_range('2020-02-23 06:00', freq = 'H', periods=1008))
hours
# 2020-02-23 06:00:00 1
# 2020-02-23 07:00:00 1
# ..
# 2020-04-05 04:00:00 1
# 2020-04-05 05:00:00 1
# Freq: H, Length: 1000, dtype: int64
I would like to sum these to months while considering, that days start at 06:00 in this use-case. The result should be:
2020-02-01 06:00:00 168
2020-03-01 06:00:00 744
2020-04-01 06:00:00 96
freq: MS, dtype: int64
How do I do that??
I can aggregate to days while considering the offset, using the offset
parameter:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
days
# 2020-02-23 06:00:00 24
# 2020-02-24 06:00:00 24
# ..
# 2020-04-03 06:00:00 24
# 2020-04-04 06:00:00 24
# Freq: D, dtype: int64
Using the same method to aggregate to months does not work. The timestamps do not have a time component, and the values are incorrect:
months = hours.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 162 # wrong
# 2020-03-01 744
# 2020-04-01 102 # wrong
# Freq: MS, dtype: int64
I could do the aggregation to months as a second step after aggregating to days. In that case, the values are correct, but the time component is still missing from the timestamps:
days = hours.resample('D', offset=pd.Timedelta('06:00:00')).sum()
months = days.resample('MS', offset=pd.Timedelta('06:00:00')).sum()
months
# 2020-02-01 168
# 2020-03-01 744
# 2020-04-01 96
# Freq: MS, dtype: int64
My current workaround is adding the timedelta and resetting the frequency manually.
months.index += pd.Timedelta('06:00:00')
months.index.freq = 'MS'
months
# 2020-02-01 06:00:00 168
# 2020-03-01 06:00:00 744
# 2020-04-01 06:00:00 96
# freq: MS, dtype: int64
Upvotes: 5
Views: 1166
Reputation: 28243
Not too much of an improvement on your attempt, but you could write the resampling as
months = hours.resample('D', offset='06:00:00').sum().resample('MS').sum()
changing the index labels still requires the hack you've been doing, as in adding the time delta manually and setting freq
to MS
note that you can pass a string representation of the time delta to offset.
The reason two resampling operations are needed is because when the resampling frequency is greater than 'D', the offset is ignored. Once your resample at the daily level is performed with the offset, the result can be further resampled without specifying the offset.
I believe this is buggy behaviour, and I agree with you that hours.resample('MS', offset='06:00:00').sum()
should produce the expected result.
Essentially, there are two issues:
That there is a related bug issue impacting resampling with offsets. I have not determined yet whether that and the issue you face have the same root cause. Its the same root cause.
Upvotes: 1