blaylockbk
blaylockbk

Reputation: 3351

How to use Pandas groupby method to bin hours with the 00:00 hour as the center of first bin

Using the Pandas groupby method to group data by hour of day is straightforward:

import pandas as pd
import numpy as np

# Create a sample dataset, a value for each hour in 48 hour
size = 48
df = pd.DataFrame(np.random.rand(size), 
                  index=pd.date_range('2021-01-01', periods=size, freq='H'))

# Group the data by hour of day and find the mean
df.groupby(df.index.hour).mean()

Sometimes, it is needed to group the hours into bins, and this is accomplished with the pandas.cut method as shown here. This bins the hours into 00:00-05:59, 06:00-11:59, 12:00-17:59, and 18:00-23:59

# Group by bins
bins = [0, 6, 12, 18, 24]
df['time_bin'] = pd.cut(df.index.hour, bins, right=False)
df.groupby('time_bin').mean()

However, binning the hours so that the hour 00:00 is in the center of the first bin is often desired, 21:00-02:59, 03:00-08:59, 09:00-14:59, and 15:00-20:59, but this is not possible...

# Use 00:00 as center of first bin
bins = [21, 3, 9, 15, 21]
df['time_bin'] = pd.cut(df.index.hour, bins, right=False)

# ValueError: bins must increase monotonically.

How can you groupby hour bins so that the 00:00 hour is in the center of the first bin?

Upvotes: 2

Views: 549

Answers (2)

ALollz
ALollz

Reputation: 59579

Use the offset argument of resample/pd.Grouper. I'll create a simple DataFrame with second resolution and make a column of the index values that way when we resample we can see the min and max times in each bin as a proof of concept.

import pandas as pd

# Create a sample dataset, a value for each hour in 48 hour
size = 26*60*60
df = pd.DataFrame(range(size), 
                  index=pd.date_range('2020-12-31 11:00:00', periods=size, freq='s'))
df['time'] = df.index

Resample with 2 hour buckets offset by a single hour. Because the default of resample/pd.Grouper is:

origin='start_day': origin is the first day at midnight of the timeseries

we can be sure that the offset shifts the bins from [0-2), [2,4) to [23-1), [1,3), ...

res = df.resample('2H', offset='1H')['time'].agg(['min', 'max'])

#                                    min                 max
#2020-12-31 11:00:00 2020-12-31 11:00:00 2020-12-31 12:59:59
#2020-12-31 13:00:00 2020-12-31 13:00:00 2020-12-31 14:59:59
#2020-12-31 15:00:00 2020-12-31 15:00:00 2020-12-31 16:59:59
#2020-12-31 17:00:00 2020-12-31 17:00:00 2020-12-31 18:59:59
#...
#2021-01-01 11:00:00 2021-01-01 11:00:00 2021-01-01 12:59:59

The bins (i.e. the index) get labeled with the left edge; you can adjust that after the fact by adding the offset in the resample.

from pandas.tseries.frequencies import to_offset

res.index = res.index + to_offset('1H')

#                                    min                 max
#2020-12-31 12:00:00 2020-12-31 11:00:00 2020-12-31 12:59:59
#2020-12-31 14:00:00 2020-12-31 13:00:00 2020-12-31 14:59:59
#2020-12-31 16:00:00 2020-12-31 15:00:00 2020-12-31 16:59:59
#2020-12-31 18:00:00 2020-12-31 17:00:00 2020-12-31 18:59:59
#....
#2021-01-01 12:00:00 2021-01-01 11:00:00 2021-01-01 12:59:59

Upvotes: 1

blaylockbk
blaylockbk

Reputation: 3351

I had to offset some of the hours to make them monotonic for the bins I needed. Below sets hour 23 as -1, hour 22 as -2, and hour 21 as -3.

# Create column of monotonic hours for the desired bins
hours = df.index.hour.to_numpy()
hours[df.index.hour>=21] -= 24
df['hours'] = hours

Now we can specify the first bin as -03:00-02:59, which puts 00:00 in the center of that bin.

bins = [-3, 3, 9, 15, 21]
df['time_bin'] = pd.cut(df['hours'], bins, right=False)
df.groupby('time_bin').mean()

Side note for xarray users: This general method may also be used for the xarray.groupby_bins method.

# Where `ds` is an xarray.Dataset with a `time` coordinate.
HOURS = ds.time.dt.hour
HOURS[HOURS>=21] -= 24
bins = [-3, 3, 9, 15, 21]
ds.groupby_bins(HOURS, bins, right=False).mean()

Upvotes: 0

Related Questions