Reputation: 141
I have a DataFrame with two columns that will be used in group by( GroupBy1 and GroupBy2 ), with dozens of columns that will be measures using agg() ( MesA:Max, MesB:Min, MesC:sum.. ) and other columns that aren't for measures, but for informations like datetime of last row in the groupby, 'GroupName1', 'GroupName2', GroupId of another database, and others.
TicketsDBFrame
GroupBy1 GroupBy2 GroupName1 GroupName2 MesA MesB MesC MesD LastTicketTime GroupId1 GroupId2
1 1 First First 2 3 1 6 2021-04-05 01:00:00 4 99
1 1 First First 4 1 3 2 2021-04-05 02:00:00 4 99
1 1 First First 2 5 2 1 2021-04-05 03:00:00 4 99
1 2 First Second 2 5 2 1 2021-04-05 01:30:00 4 75
1 2 First Second 1 4 7 3 2021-04-05 02:30:00 4 75
2 2 Second Second 4 2 1 8 2021-04-05 02:00:00 2 75
2 2 Second Second 1 6 3 1 2021-04-05 04:00:00 2 75
The desired output:
GroupBy1 GroupBy2 GroupName1 GroupName2 MesA MesB MesC MesD LastTicketTime GroupId1 GroupId2
1 1 First First 4 1 6 9 2021-04-05 03:00:00 4 99
1 2 First Second 2 4 4 2 2021-04-05 02:30:00 4 75
2 2 Second Second 4 2 4 9 2021-04-05 04:00:00 2 75
I already know how to create this desired Frame using derived DataFrames, using 'loc' and 'idxmax' to get LastTicketTime in a frame, other derived frame to 'Ids and Names' and another DataFrame to call agg() for the measure columns, after that I do a merge in the frames
groupInfoFrame:Little Derived Frame created associating GroupBy1 - GroupName1 - GroupId1
lastTicketFrame: Frame with only LastTicketTime
lastTicketFrame=TicketsDBFrame[['GroupBy1','GroupBy2','LastTicketTime' ]]
lastTicketFrame=lastTicketFrame.loc[lastTicketFrame.groupby(['GroupBy1'],['GroupBy2]).LastTicketTime.idxmax() ]
measuresFrame: Only measures
measuresFrame = TicketsDBFrame.groupby(['GroupBy1'],['GroupBy2]).agg( mesA:.....MesD )
After all I do a merge in the measureFrame and lastTicketFrame using the GroupBy1 and GroupBy2 as keys
Is it possible to have all these information in only one agg() or transform() or other function call? Without derived frames and merges
Upvotes: 0
Views: 40
Reputation: 9941
You can do it in a single agg
(by sorting values by LastTicketTime
and taking last
in agg
):
(df
.sort_values('LastTicketTime')
.groupby(['GroupBy1', 'GroupBy2'], as_index=False)
.agg({
'GroupName1': 'last',
'GroupName2': 'last',
'MesA': 'max',
'MesB': 'min',
'MesC': 'sum',
'MesD': 'sum',
'LastTicketTime': 'last',
'GroupId1': 'last',
'GroupId2': 'last'
}))
Output:
GroupBy1 GroupBy2 GroupName1 GroupName2 MesA MesB MesC MesD \
0 1 1 First First 4 1 6 9
1 1 2 First Second 2 4 9 4
2 2 2 Second Second 4 2 4 9
LastTicketTime GroupId1 GroupId2
0 2021-04-05 03:00:00 4 99
1 2021-04-05 02:30:00 4 75
2 2021-04-05 04:00:00 2 75
P.S. If I'm not mistaken, there seems to be a problem in your expected output for GroupName1 = First, GroupName2 = Second: MesC
and MesD
are sums, and should be 9 and 4, respectively (instead of 4 and 2).
Upvotes: 1