Reputation: 33
I am struggling to find the correct way to group a DataFrame using some constraints. I have the following Dataframe:
start_dt machine benchmark value1 value2 value3
2021-06-07 07:32:01 A bench1 0 0 0
2021-06-07 07:32:37 A bench1 0 0 0
2021-06-07 07:33:13 A bench1 0 0 0
2021-06-07 07:33:49 A bench1 0 0 0
2021-06-07 07:34:26 A bench1 0 0 0
2021-06-07 08:30:26 A bench1 0 0 10
2021-06-07 11:12:21 A bench1 0 0 6
2021-06-07 12:05:21 A bench1 1 0 10
2021-06-17 12:28:57 A bench2 0 0 0
2021-06-17 12:29:29 A bench2 0 0 0
2021-06-17 12:33:09 A bench2 3 0 1
2021-06-17 12:33:48 A bench2 3 0 1
2021-06-17 12:35:17 A bench2 0 0 0
I want to group base on machine, benchmark, and start_dt columns. However, it has some constraints on the start_dt column. start_dt group criteria must be on 1h chunks. I have tried the following command:
df.groupby(["machine", "benchmark", pd.Grouper(key="start_dt", freq="1h", sort=True, origin="start")]).sum()
However, it will group the dataframe based on the first datetime on for all benchmaks, and I don't want this. What I would like is something like the following, where end_dt is start_dt + 1h.
machine benchmark start_dt end_dt value1 value2 value3
A bench1 2021-06-07 07:32:01 2021-06-07 08:32:01 0 0 10
2021-06-07 11:12:21 2021-06-07 12:12:21 1 0 16
bench2 2021-06-17 12:28:57 2021-06-17 13:28:57 6 0 2
For example for machine A and benchmark bench1 there are at least two time intervals
2021-06-07 07:32:01 2021-06-07 08:32:01
2021-06-07 11:12:21 2021-06-07 12:12:21
but nothing in the middle, consequentially I would like to maintain the time intervals as they appear on the column instead of what pandas Grouper gives to me. Is it possible?
Edits:
Upvotes: 3
Views: 383
Reputation: 9019
Yes, this is possible, you'll just need to create a custom grouping function in order to handle the non-uniformity of your use case. In the solution below, I am first creating a new column end_dt
, which we will later use as our innermost grouping index. In order to create this column, we are invoking a function get_end_times()
using the start_dt
column that will take each group (machine
/benchmark
combo) and call the run_calc()
inner function. This function uses the first start_dt
in the slice of the dataframe passed to the function to determine where to set the endpoint (1-hr later). It then checks which elements fall within that span and returns the set of end_dt
that will be reassigned to the group that invoked the inner function. This iterates until all start_dt
values have been assigned an end_dt
value (checked via (~f).all()
). See below for full implementation:
def run_calc(x):
i = (x - x.iloc[0]).dt.total_seconds()>3600
x[~i] = x.iloc[0] + np.timedelta64(1, 'h')
return x, i
def get_end_times(group):
f = pd.Series([True]*len(group), index=group.index)
iterate = True
while iterate:
new, f = run_calc(group[f])
group[(~f).index] = new
if (~f).all(): iterate = False
return group
df['end_dt'] = df.groupby(['machine','benchmark'])['start_dt'].transform(get_end_times)
df.groupby(['machine','benchmark','end_dt']).agg({'start_dt': 'first', 'value1': 'sum', 'value2': 'sum', 'value3': 'sum'}) \
.reset_index().set_index(['machine','benchmark','start_dt','end_dt'])
Yields:
value1 value2 \
machine benchmark start_dt end_dt
A bench1 2021-06-07 07:32:01 2021-06-07 08:32:01 0 0
2021-06-07 11:12:21 2021-06-07 12:12:21 1 0
bench2 2021-06-17 12:28:57 2021-06-17 13:28:57 6 0
value3
machine benchmark start_dt end_dt
A bench1 2021-06-07 07:32:01 2021-06-07 08:32:01 10
2021-06-07 11:12:21 2021-06-07 12:12:21 16
bench2 2021-06-17 12:28:57 2021-06-17 13:28:57 2
Upvotes: 1