Reputation: 53
Say you have this DataFrame:
Name Item Date value1 value2
Marc bike 21-Dec-17 7 1000
Marc bike 05-Jan-18 9 2000
Marc bike 27-Jul-18 4 500
John house 14-Dec-17 4 500
John house 02-Feb-18 6 500
John house 07-Feb-18 8 1000
John house 16-Feb-18 2 1000
John house 05-Dec-21 7 1000
John house 27-Aug-25 8 500
John car 17-Apr-18 4 500
I would like to bin value1 and value2 into monthly buckets (each 3rd wednesday for the next 48 months) for each name-item-combination.
So there are 49 time buckets for each combination with the sum of value1 and value2 for each month: Marc/bike, John/house, John/car, ...
The solution for John/house would look like:
Name Item TimeBucket value1 value2
John house 20-Dec-17 4 500
John house 17-Jan-18 0 0
John house 21-Feb-18 16 2500
John house 21-Mar-18 0 0
John house 18-Apr-18 0 0
John house … 0 0
John house 17-Nov-21 0 0
John house 15-Dec-21 7 1000
John house rest 8 500
i cant get to a result with pandas. The only solution I can think of is a row-by-row iteration through the dataframe, but I'd really like to avoid having to do this. Is there an elegant way to do it?
Upvotes: 3
Views: 2416
Reputation: 688
The question really boils down to three steps:
This may not be the most elegant solution, but you can filter out the third Wednesday of every month by masking a pandas DatetimeIndex
which contains every day in the time frame.
# generate a DatetimeIndex for all days in the relevant time frame
from datetime import datetime
start = datetime(2017, 12, 1)
end = datetime(2022, 1, 31)
days = pd.date_range(start, end, freq='D')
# filter out only the third wednesday of each month
import itertools
third_wednesdays = []
for year, month in itertools.product(range(2017, 2023), range(1,13)):
mask = (days.weekday == 2) & \
(days.year == year) & \
(days.month == month)
if len(days[mask]) > 0:
third_wednesdays.append(days[mask][2])
bucket_lower_bounds = pd.DatetimeIndex(third_wednesdays)
Convert the resulting list to a DatetimeIndex
so you can use it as the lower bounds of the bins in step 2.
Then, once you have the list of buckets as a DatetimeIndex
, you can simply use panda's cut function to assign each date to a bucket. Convert date columns to integers before passing them into cut
, and then convert the result back to dates:
time_buckets = pd.to_datetime(
pd.cut(
x = pd.to_numeric(df['Date']),
bins = pd.to_numeric(bucket_lower_bounds),
labels = bucket_lower_bounds[:-1]
)
)
The series time_buckets
assigns each index value of your original data frame to a bucket's lower bound. We can simply add it to the original data frame now:
df['TimeBucket'] = time_buckets
The result should look somewhat like this (not that NaT
stands for the "rest" bucket):
Name Item Date value1 value2 TimeBucket
0 Marc bike 2017-12-21 7 1000 2017-12-20
1 Marc bike 2018-01-05 9 2000 2017-12-20
2 Marc bike 2018-07-27 4 500 2018-07-18
3 John house 2017-12-14 4 500 NaT
4 John house 2018-02-02 6 500 2018-01-17
5 John house 2018-02-07 8 1000 2018-01-17
6 John house 2018-02-16 2 1000 2018-01-17
7 John house 2021-12-05 7 1000 2021-11-17
8 John house 2025-08-27 8 500 NaT
9 John car 2018-04-17 4 500 2018-03-21
Now it's as simple as using groupby
to get the sums for each combination of name, item and bucket:
df.groupby(['Name','Item','TimeBucket']).sum()
Result:
Name Item TimeBucket value1 value2
John car 2018-03-21 4 500
house 2018-01-17 16 2500
2021-11-17 7 1000
Marc bike 2017-12-20 16 3000
2018-07-18 4 500
Unfortunately, NaT values are excluded from groupby. If you need to sum those as well, perhaps it would be easiest to make sure that your list of buckets has at least one bucket for every date in your input range.
Edit: Step 2 requires pandas version >= 0.18.1.
Upvotes: 2