thedude
thedude

Reputation: 53

Python: Grouping values of different columns into time buckets

Say you have this DataFrame:

Name    Item    Date    value1  value2
Marc    bike    21-Dec-17   7   1000
Marc    bike    05-Jan-18   9   2000
Marc    bike    27-Jul-18   4   500
John    house   14-Dec-17   4   500
John    house   02-Feb-18   6   500
John    house   07-Feb-18   8   1000
John    house   16-Feb-18   2   1000
John    house   05-Dec-21   7   1000
John    house   27-Aug-25   8   500
John    car     17-Apr-18   4   500

I would like to bin value1 and value2 into monthly buckets (each 3rd wednesday for the next 48 months) for each name-item-combination.

So there are 49 time buckets for each combination with the sum of value1 and value2 for each month: Marc/bike, John/house, John/car, ...

The solution for John/house would look like:

Name    Item    TimeBucket  value1  value2
John    house   20-Dec-17   4   500
John    house   17-Jan-18   0   0
John    house   21-Feb-18   16  2500
John    house   21-Mar-18   0   0
John    house   18-Apr-18   0   0
John    house   …           0   0
John    house   17-Nov-21   0   0
John    house   15-Dec-21   7   1000
John    house   rest        8   500

i cant get to a result with pandas. The only solution I can think of is a row-by-row iteration through the dataframe, but I'd really like to avoid having to do this. Is there an elegant way to do it?

Upvotes: 3

Views: 2416

Answers (1)

Cornflex
Cornflex

Reputation: 688

The question really boils down to three steps:

1. How to find the third Wednesday of every month?

This may not be the most elegant solution, but you can filter out the third Wednesday of every month by masking a pandas DatetimeIndex which contains every day in the time frame.

# generate a DatetimeIndex for all days in the relevant time frame
from datetime import datetime
start = datetime(2017, 12, 1)
end = datetime(2022, 1, 31)
days = pd.date_range(start, end, freq='D')

# filter out only the third wednesday of each month
import itertools
third_wednesdays = []
for year, month in itertools.product(range(2017, 2023), range(1,13)):
    mask = (days.weekday == 2) & \
        (days.year == year) & \
        (days.month == month)
    if len(days[mask]) > 0:
        third_wednesdays.append(days[mask][2])
bucket_lower_bounds = pd.DatetimeIndex(third_wednesdays)

Convert the resulting list to a DatetimeIndex so you can use it as the lower bounds of the bins in step 2.

2. How to bin values of a DataFrame?

Then, once you have the list of buckets as a DatetimeIndex, you can simply use panda's cut function to assign each date to a bucket. Convert date columns to integers before passing them into cut, and then convert the result back to dates:

time_buckets = pd.to_datetime(
    pd.cut(
        x = pd.to_numeric(df['Date']), 
        bins = pd.to_numeric(bucket_lower_bounds), 
        labels = bucket_lower_bounds[:-1]
    )
)

The series time_buckets assigns each index value of your original data frame to a bucket's lower bound. We can simply add it to the original data frame now:

df['TimeBucket'] = time_buckets

The result should look somewhat like this (not that NaT stands for the "rest" bucket):

    Name    Item    Date    value1  value2  TimeBucket
0   Marc    bike    2017-12-21  7   1000    2017-12-20
1   Marc    bike    2018-01-05  9   2000    2017-12-20
2   Marc    bike    2018-07-27  4   500     2018-07-18
3   John    house   2017-12-14  4   500     NaT
4   John    house   2018-02-02  6   500     2018-01-17
5   John    house   2018-02-07  8   1000    2018-01-17
6   John    house   2018-02-16  2   1000    2018-01-17
7   John    house   2021-12-05  7   1000    2021-11-17
8   John    house   2025-08-27  8   500     NaT
9   John    car     2018-04-17  4   500     2018-03-21

3. How to aggregate a binned DataFrame?

Now it's as simple as using groupby to get the sums for each combination of name, item and bucket:

df.groupby(['Name','Item','TimeBucket']).sum()

Result:

Name    Item    TimeBucket  value1  value2
John    car     2018-03-21  4       500
        house   2018-01-17  16      2500
                2021-11-17  7       1000
Marc    bike    2017-12-20  16      3000
                2018-07-18  4       500

Unfortunately, NaT values are excluded from groupby. If you need to sum those as well, perhaps it would be easiest to make sure that your list of buckets has at least one bucket for every date in your input range.

Edit: Step 2 requires pandas version >= 0.18.1.

Upvotes: 2

Related Questions