Reputation: 2616
Suppose, I have a dataframe in pandas like below:
campaignname category_type amount
A cat_A_0 2.0
A cat_A_1 1.0
A cat_A_2 3.0
A cat_A_2 3.0
A cat_A_2 4.0
B cat_B_0 3.0
C cat_C_0 1.0
C cat_C_1 2.0
I am using the following code to group the above dataframe (say it's assigned variable name df
) by different columns as follows:
for name, gp in df.groupby('campaignname'):
sorted_gp = gp.groupby(['campaignname', 'category_type']).sum().sort_values(['amount'], ascending=False)
# I'd like to know how to select this in a cleaner/more concise way
first_row = [sorted_gp.iloc[0].name[0], sorted_gp.iloc[0].name[1], sorted_gp.iloc[0].values.tolist()[0]]
The purpose of the above code is to first groupby
the raw data on campaignname
column, then in each of the resulting group, I'd like to group again by both campaignname
and category_type
, and finally, sort by amount
column to choose the first row that comes up (the one with the highest amount
in each group. Specifically for the above example, I'd like to get results like this:
first_row = ['A', 'cat_A_2', 4.0] # for the first group
first_row = ['B', 'cat_B_0', 3.0] # for the second group
first_row = ['C', 'cat_C_1', 2.0] # for the third group
etc.
As you can see, I'm using a rather (in my opinion) 'ugly' way to retrieve the first row of each sorted group, but since I'm new to pandas, I don't know a better/cleaner way to accomplish this. If anyone could let me know a way to select the first row in a sorted group from a pandas dataframe, I'd greatly appreciate it. Thank you in advance for your answers/suggestions!
Upvotes: 9
Views: 23368
Reputation: 294258
My preferred way to do this is with idxmax
. It returns the index of the maximum value. I subsequently use that index to slice df
df.loc[df.groupby('campaignname').amount.idxmax()]
campaignname category_type amount
4 A cat_A_2 4.0
5 B cat_B_0 3.0
7 C cat_C_1 2.0
Upvotes: 4
Reputation: 210842
IIUC you can do it this way:
In [83]: df.groupby('campaignname', as_index=False) \
.apply(lambda x: x.nlargest(1, columns=['amount'])) \
.reset_index(level=1, drop=1)
Out[83]:
campaignname category_type amount
0 A cat_A_2 4.0
1 B cat_B_0 3.0
2 C cat_C_1 2.0
or:
In [76]: df.sort_values('amount', ascending=False).groupby('campaignname').head(1)
Out[76]:
campaignname category_type amount
4 A cat_A_2 4.0
5 B cat_B_0 3.0
7 C cat_C_1 2.0
Upvotes: 10