Reputation: 117
First question on SO, very new to pandas and still a little shaky on the terminology: I'm trying to figure out the proper syntax/sequence of operations on a dataframe to be able to group by column B, find the max (or min) corresponding value for each group in column C, and retrieve the corresponding value for that in column A.
Suppose this is my dataframe:
name type votes
bob dog 10
pete cat 8
fluffy dog 5
max cat 9
Using df.groupby('type').votes.agg('max')
returns:
dog 10
cat 9
So far, so good. However, I'd like to figure out how to return this:
dog 10 bob
cat 9 max
I've gotten as far as df.groupby(['type', 'votes']).name.agg('max')
, though that returns
dog 5 fluffy
10 bob
cat 8 pete
9 max
... which is fine for this pretend dataframe, but doesn't quite help when working with a much larger one.
Thanks very much!
Upvotes: 9
Views: 6522
Reputation: 149
df= df.groupby('type').agg({'votes': ['count','mean','median','min','max']})
print(ds_method_range)
Upvotes: 0
Reputation: 879729
If df
has an index with no duplicate values, then you can use idxmax
to return the index of the maximum row for each group. Then use df.loc
to select the entire row:
In [322]: df.loc[df.groupby('type').votes.agg('idxmax')]
Out[322]:
name type votes
3 max cat 9
0 bob dog 10
If df.index
has duplicate values, i.e. is not a unique index, then make the index unique first:
df = df.reset_index()
then use idxmax
:
result = df.loc[df.groupby('type').votes.agg('idxmax')]
If you really need to, you can return df
to its original state:
df = df.set_index(['index'], drop=True)
but in general life is much better with a unique index.
Here is an example showing what goes wrong when df
does not have a unique
index. Suppose the index
is AABB
:
import pandas as pd
df = pd.DataFrame({'name': ['bob', 'pete', 'fluffy', 'max'],
'type': ['dog', 'cat', 'dog', 'cat'],
'votes': [10, 8, 5, 9]},
index=list('AABB'))
print(df)
# name type votes
# A bob dog 10
# A pete cat 8
# B fluffy dog 5
# B max cat 9
idxmax
returns the index values A
and B
:
print(df.groupby('type').votes.agg('idxmax'))
type
cat B
dog A
Name: votes, dtype: object
But A
and B
do not uniquely specify the desired rows. df.loc[...]
returns all rows whose index value is A
or B
:
print(df.loc[df.groupby('type').votes.agg('idxmax')])
# name type votes
# B fluffy dog 5
# B max cat 9
# A bob dog 10
# A pete cat 8
In contrast, if we reset the index:
df = df.reset_index()
# index name type votes
# 0 A bob dog 10
# 1 A pete cat 8
# 2 B fluffy dog 5
# 3 B max cat 9
then df.loc
can be used to select the desired rows:
print(df.groupby('type').votes.agg('idxmax'))
# type
# cat 3
# dog 0
# Name: votes, dtype: int64
print(df.loc[df.groupby('type').votes.agg('idxmax')])
# index name type votes
# 3 B max cat 9
# 0 A bob dog 10
Upvotes: 10