qwerty
qwerty

Reputation: 887

Aggregate rows with same values in specific variables, by ID and Date - Pandas

My data frame indexed and sorted by ID, date and time, and also includes X1, X2 and Y variables. I want to aggregate rows with same values in X1 & X2 that adjacent to each other in every (ID, date) index. The aggregation will be a sum of Y variable, and I want to stay with the row that "happened" earlier (by time).

Data illustration:

ID  date        time    X1  X2  Y
A   07/27/2019  10:00   2   3   100
                12:00   2   3   200
                14:00   2   4   50
    08/27/2019  11:00   2   4   10
                12:00   2   3   20
                13:00   2   3   30
                14:00   2   4   40
B   06/28/2019  16:00   2   4   100
                17:00   2   3   80
                18:00   2   3   80

Expected result:

ID  date        time    X1  X2  Y
A   07/27/2019  10:00   2   3   300
                14:00   2   4   50
    08/27/2019  11:00   2   4   10
                12:00   2   3   50
                14:00   2   4   40
B   06/28/2019  16:00   2   4   100
                17:00   2   3   160

Upvotes: 0

Views: 687

Answers (2)

jezrael
jezrael

Reputation: 863166

I think there is MultiIndex in input values:

print (df.index)
MultiIndex([('A', '07/27/2019', '10:00'),
            ('A', '07/27/2019', '12:00'),
            ('A', '07/27/2019', '14:00'),
            ('A', '08/27/2019', '11:00'),
            ('A', '08/27/2019', '12:00'),
            ('A', '08/27/2019', '13:00'),
            ('A', '08/27/2019', '14:00'),
            ('B', '06/28/2019', '16:00'),
            ('B', '06/28/2019', '17:00'),
            ('B', '06/28/2019', '18:00')],
           names=['ID', 'date', 'time'])

So first use reset_index:

df = df.reset_index()

Then create unique groups by shift and cumsum, aggregate and again set MultiIndex:

g = df[['X1', 'X2']].ne(df[['X1', 'X2']].shift()).any(1).cumsum()
df = (df.groupby(['ID', 'date', 'X1', 'X2', g], sort=False)
       .agg({'time':'first', 'Y':'sum'})
       .set_index('time', append=True)
       .reset_index(level=4, drop=True)
       .reset_index(['X1','X2']))
print (df)

                     X1  X2    Y
ID date       time              
A  07/27/2019 10:00   2   3  300
              14:00   2   4   50
   08/27/2019 11:00   2   4   10
              12:00   2   3   50
              14:00   2   4   40
B  06/28/2019 16:00   2   4  100
              17:00   2   3  160

Upvotes: 1

Erfan
Erfan

Reputation: 42916

First we check if it's a consecutive duplicate with diff and if that difference is equal to 0 for both X1 and X2 it's a consecutive duplicate and we need to sum them.

Then we use GroupBy on ID, date, X1, X2 and sum of Y:

s = df[['X1', 'X2']].diff(-1).eq(0).all(axis=1).cumsum()

grpd = df.groupby(['ID', 'date', 'X1', 'X2', s], sort=False).agg({'time':'first',
                                                                  'Y':'sum'}).reset_index(level=[2,3])

Output

                 X1  X2   time    Y
ID date                            
A  07/27/2019 1   2   3  10:00  300
              2   2   4  14:00   50
   08/27/2019 2   2   4  11:00   10
              3   2   3  12:00   50
              4   2   4  14:00   40
B  06/28/2019 4   2   4  16:00  100
              5   2   3  17:00  160

Upvotes: 1

Related Questions