Pandas groupby variable time intervals

I am struggling to find the correct way to group a DataFrame using some constraints. I have the following Dataframe:

           start_dt  machine     benchmark      value1  value2  value3
2021-06-07 07:32:01  A           bench1         0       0       0
2021-06-07 07:32:37  A           bench1         0       0       0
2021-06-07 07:33:13  A           bench1         0       0       0
2021-06-07 07:33:49  A           bench1         0       0       0
2021-06-07 07:34:26  A           bench1         0       0       0
2021-06-07 08:30:26  A           bench1         0       0       10
2021-06-07 11:12:21  A           bench1         0       0       6
2021-06-07 12:05:21  A           bench1         1       0       10
2021-06-17 12:28:57  A           bench2         0       0       0
2021-06-17 12:29:29  A           bench2         0       0       0
2021-06-17 12:33:09  A           bench2         3       0       1
2021-06-17 12:33:48  A           bench2         3       0       1
2021-06-17 12:35:17  A           bench2         0       0       0

I want to group base on machine, benchmark, and start_dt columns. However, it has some constraints on the start_dt column. start_dt group criteria must be on 1h chunks. I have tried the following command:

df.groupby(["machine", "benchmark", pd.Grouper(key="start_dt", freq="1h", sort=True, origin="start")]).sum()

However, it will group the dataframe based on the first datetime on for all benchmaks, and I don't want this. What I would like is something like the following, where end_dt is start_dt + 1h.

machine benchmark          start_dt               end_dt   value1  value2  value3              
A       bench1  2021-06-07 07:32:01  2021-06-07 08:32:01   0          0    10
                2021-06-07 11:12:21  2021-06-07 12:12:21   1          0    16
        bench2  2021-06-17 12:28:57  2021-06-17 13:28:57   6          0    2

For example for machine A and benchmark bench1 there are at least two time intervals

2021-06-07 07:32:01 2021-06-07 08:32:01

2021-06-07 11:12:21 2021-06-07 12:12:21

but nothing in the middle, consequentially I would like to maintain the time intervals as they appear on the column instead of what pandas Grouper gives to me. Is it possible?

Edits:

Upvotes: 3

Views: 383

Answers (1)

rahlf23
rahlf23

Reputation: 9019

Yes, this is possible, you'll just need to create a custom grouping function in order to handle the non-uniformity of your use case. In the solution below, I am first creating a new column end_dt, which we will later use as our innermost grouping index. In order to create this column, we are invoking a function get_end_times() using the start_dt column that will take each group (machine/benchmark combo) and call the run_calc() inner function. This function uses the first start_dt in the slice of the dataframe passed to the function to determine where to set the endpoint (1-hr later). It then checks which elements fall within that span and returns the set of end_dt that will be reassigned to the group that invoked the inner function. This iterates until all start_dt values have been assigned an end_dt value (checked via (~f).all()). See below for full implementation:

def run_calc(x):

    i = (x - x.iloc[0]).dt.total_seconds()>3600

    x[~i] = x.iloc[0] + np.timedelta64(1, 'h')

    return x, i

def get_end_times(group):

    f = pd.Series([True]*len(group), index=group.index)

    iterate = True

    while iterate:
        new, f = run_calc(group[f])
        group[(~f).index] = new
        if (~f).all(): iterate = False

    return group

df['end_dt'] = df.groupby(['machine','benchmark'])['start_dt'].transform(get_end_times)

df.groupby(['machine','benchmark','end_dt']).agg({'start_dt': 'first', 'value1': 'sum', 'value2': 'sum', 'value3': 'sum'}) \
    .reset_index().set_index(['machine','benchmark','start_dt','end_dt'])

Yields:

                                                           value1  value2  \
machine benchmark start_dt            end_dt                                
A       bench1    2021-06-07 07:32:01 2021-06-07 08:32:01       0       0   
                  2021-06-07 11:12:21 2021-06-07 12:12:21       1       0   
        bench2    2021-06-17 12:28:57 2021-06-17 13:28:57       6       0   

                                                           value3  
machine benchmark start_dt            end_dt                       
A       bench1    2021-06-07 07:32:01 2021-06-07 08:32:01      10  
                  2021-06-07 11:12:21 2021-06-07 12:12:21      16  
        bench2    2021-06-17 12:28:57 2021-06-17 13:28:57       2  

Upvotes: 1

Related Questions