Reputation: 887
My data frame indexed and sorted by ID, date and time, and also includes X1, X2 and Y variables. I want to aggregate rows with same values in X1 & X2 that adjacent to each other in every (ID, date) index. The aggregation will be a sum of Y variable, and I want to stay with the row that "happened" earlier (by time).
Data illustration:
ID date time X1 X2 Y
A 07/27/2019 10:00 2 3 100
12:00 2 3 200
14:00 2 4 50
08/27/2019 11:00 2 4 10
12:00 2 3 20
13:00 2 3 30
14:00 2 4 40
B 06/28/2019 16:00 2 4 100
17:00 2 3 80
18:00 2 3 80
Expected result:
ID date time X1 X2 Y
A 07/27/2019 10:00 2 3 300
14:00 2 4 50
08/27/2019 11:00 2 4 10
12:00 2 3 50
14:00 2 4 40
B 06/28/2019 16:00 2 4 100
17:00 2 3 160
Upvotes: 0
Views: 687
Reputation: 863166
I think there is MultiIndex
in input values:
print (df.index)
MultiIndex([('A', '07/27/2019', '10:00'),
('A', '07/27/2019', '12:00'),
('A', '07/27/2019', '14:00'),
('A', '08/27/2019', '11:00'),
('A', '08/27/2019', '12:00'),
('A', '08/27/2019', '13:00'),
('A', '08/27/2019', '14:00'),
('B', '06/28/2019', '16:00'),
('B', '06/28/2019', '17:00'),
('B', '06/28/2019', '18:00')],
names=['ID', 'date', 'time'])
So first use reset_index
:
df = df.reset_index()
Then create unique groups by shift
and cumsum
, aggregate and again set MultiIndex
:
g = df[['X1', 'X2']].ne(df[['X1', 'X2']].shift()).any(1).cumsum()
df = (df.groupby(['ID', 'date', 'X1', 'X2', g], sort=False)
.agg({'time':'first', 'Y':'sum'})
.set_index('time', append=True)
.reset_index(level=4, drop=True)
.reset_index(['X1','X2']))
print (df)
X1 X2 Y
ID date time
A 07/27/2019 10:00 2 3 300
14:00 2 4 50
08/27/2019 11:00 2 4 10
12:00 2 3 50
14:00 2 4 40
B 06/28/2019 16:00 2 4 100
17:00 2 3 160
Upvotes: 1
Reputation: 42916
First we check if it's a consecutive duplicate with diff
and if that difference is equal to 0
for both X1
and X2
it's a consecutive duplicate and we need to sum them.
Then we use GroupBy
on ID, date, X1, X2
and sum
of Y
:
s = df[['X1', 'X2']].diff(-1).eq(0).all(axis=1).cumsum()
grpd = df.groupby(['ID', 'date', 'X1', 'X2', s], sort=False).agg({'time':'first',
'Y':'sum'}).reset_index(level=[2,3])
Output
X1 X2 time Y
ID date
A 07/27/2019 1 2 3 10:00 300
2 2 4 14:00 50
08/27/2019 2 2 4 11:00 10
3 2 3 12:00 50
4 2 4 14:00 40
B 06/28/2019 4 2 4 16:00 100
5 2 3 17:00 160
Upvotes: 1