Aggregate rows with same values in specific variables, by ID and Date - Pandas

Question

My data frame indexed and sorted by ID, date and time, and also includes X1, X2 and Y variables. I want to aggregate rows with same values in X1 & X2 that adjacent to each other in every (ID, date) index. The aggregation will be a sum of Y variable, and I want to stay with the row that "happened" earlier (by time).

Data illustration:

ID  date        time    X1  X2  Y
A   07/27/2019  10:00   2   3   100
                12:00   2   3   200
                14:00   2   4   50
    08/27/2019  11:00   2   4   10
                12:00   2   3   20
                13:00   2   3   30
                14:00   2   4   40
B   06/28/2019  16:00   2   4   100
                17:00   2   3   80
                18:00   2   3   80

Expected result:

ID  date        time    X1  X2  Y
A   07/27/2019  10:00   2   3   300
                14:00   2   4   50
    08/27/2019  11:00   2   4   10
                12:00   2   3   50
                14:00   2   4   40
B   06/28/2019  16:00   2   4   100
                17:00   2   3   160

jezrael · Accepted Answer

I think there is MultiIndex in input values:

print (df.index)
MultiIndex([('A', '07/27/2019', '10:00'),
            ('A', '07/27/2019', '12:00'),
            ('A', '07/27/2019', '14:00'),
            ('A', '08/27/2019', '11:00'),
            ('A', '08/27/2019', '12:00'),
            ('A', '08/27/2019', '13:00'),
            ('A', '08/27/2019', '14:00'),
            ('B', '06/28/2019', '16:00'),
            ('B', '06/28/2019', '17:00'),
            ('B', '06/28/2019', '18:00')],
           names=['ID', 'date', 'time'])

So first use reset_index:

df = df.reset_index()

Then create unique groups by shift and cumsum, aggregate and again set MultiIndex:

g = df[['X1', 'X2']].ne(df[['X1', 'X2']].shift()).any(1).cumsum()
df = (df.groupby(['ID', 'date', 'X1', 'X2', g], sort=False)
       .agg({'time':'first', 'Y':'sum'})
       .set_index('time', append=True)
       .reset_index(level=4, drop=True)
       .reset_index(['X1','X2']))
print (df)

                     X1  X2    Y
ID date       time              
A  07/27/2019 10:00   2   3  300
              14:00   2   4   50
   08/27/2019 11:00   2   4   10
              12:00   2   3   50
              14:00   2   4   40
B  06/28/2019 16:00   2   4  100
              17:00   2   3  160

Aggregate rows with same values in specific variables, by ID and Date - Pandas

Answers (2)

Related Questions