Arhiliuc Cristina
Arhiliuc Cristina

Reputation: 323

pandas dataframe group by max on one column

I'm trying to group by the 'keyword' column and get the characteristic with the larger number of records.

Let's consider the pandas df:

pd.DataFrame([['a', 'A'], ['b', 'A'], ['a', 'B'], ['b', 'B'], ['a', 'A'], ['c', 'B']], columns=['Keywords', 'Char'])

For the keyword a the characteristic A is the most frequent, for the keyword b either A or B are ok, for the keyword c, B is the most frequent.

In my case I have 10000 keywords and 3 characteristics. I want to have as a return a pd.Series with the keyword as an index and the most frequent characteristic as value or a dictionary with keyword as key and the most frequent characteristic as value.

I tried grouping my keywords and characteristics and count the rows as following:

res = frame.groupby(['Keywords', 'Char']).size().reset_index().rename(columns={0:'records'})

But I don't know how to get the characteristic corresponding to the maximum.

Expected output (any of this is ok):

pd.Series(data=['A', 'A', 'B'], index = ['a', 'b', 'c'])

or

pd.Series(data=['A', 'B', 'B'], index = ['a', 'b', 'c'])

or

{'a':'A', 'b':'A', 'c':'B'}

or

{'a':'A', 'b':'B', 'c':'B'}

Upvotes: 1

Views: 126

Answers (1)

jezrael
jezrael

Reputation: 863791

Use Series.value_counts in lambda function per groups by Keywords and return first value of index. Idea of solution is use value_counts because sorting values by default by counts:

res = frame.groupby('Keywords')['Char'].apply(lambda x: x.value_counts().index[0])
print (res)
Keywords
a    A
b    B
c    B
Name: Char, dtype: object

Upvotes: 3

Related Questions