Find maximum in Dataframe based on variable values

Question

I have a dataframe of the form:

A| B| C | D
a| x| r | 1
a| x| s | 2
a| y| r | 1
b| w| t | 4
b| z| v | 2

I'd like to be able to return something like (showing unique values and frequency)

A| freq of most common value in Column B |maximum of column D based on the most common value in Column B | most common value in Column B
a       2                                                  2                                                           x
b       1                                                  4                                                           w

at the moment i can calculate the everything but the 3 column of the result dataframe quiet fast via

df = (df.groupby('A', sort=False)['B']
    .apply(lambda x: x.value_counts().head(1))
    .reset_index()

but to calculate the 2 Column ("maximum of column D based on the most common value in Column B") i have writen a for-loop witch is slow for a lot of data. Is there a fast way?

The question is linked to: Count values in dataframe based on entry

jezrael · Accepted Answer

Use merge with get rows by maximum D per groups by DataFrameGroupBy.idxmax:

df1 = (df.groupby('A', sort=False)['B']
        .apply(lambda x: x.value_counts().head(1))
        .reset_index()
        .rename(columns={'level_1':'E'}))
#print (df1)

df = df1.merge(df, left_on=['A','E'], right_on=['A','B'], suffixes=('','_'))
df = df.loc[df.groupby('A')['D'].idxmax(), ['A','B','D','E']]
print (df)
   A  B  D  E
1  a  2  2  x
2  b  1  4  w

Find maximum in Dataframe based on variable values

Answers (2)

Related Questions