Jude92
Jude92

Reputation: 169

Logic in Python to return only one row among similar row(s) based on the maximum value in a column in a dataframe

I am a newbie to Python and I am needing a solution for this example below, This is how my df looks like:

Index   classcode   product_id  Season Sales Score
1      65 102 00    210190062   2018_2  1000   3
2      65 102 00    210190062   2018_2  1000   5
89     66 107 00    210189987   2018_4  1500   10

I just need the one row based on min or max value of the 'Score'column from Index 1 or 2 and the row with index 89 as well. The only different value among Index 1 and 2 is the Score which is always unique while the rest of the colunmns are identical.The Score is not the same for the same product_id or classode or any other column(s) in the df. I just want to eliminate the double counting of the Sales. Is there a function or logic in pandas to achieve this? I tried creating a new data frame by returning all columns and grouping them by the max of Score and it did not work. I have done this in SQL using window functions but not sure about what to do here. The Index is the default index created from the data frame. The expected output for the example would be like below,

Index   classcode   product_id  Season Sales Score
2      65 102 00    210190062   2018_2  1000   5
89     66 107 00    210189987   2018_4  1500   10

Upvotes: 0

Views: 221

Answers (2)

Erfan
Erfan

Reputation: 42916

There several ways to do this:

groupby & transform

cols = ['classcode', 'product_id', 'Season', 'Sales']

df[df['Score'].eq(df.groupby(cols)['Score'].transform('max'))]

       classcode  product_id  Season  Sales  Score
Index                                             
2      65 102 00   210190062  2018_2   1000      5
89     66 107 00   210189987  2018_4   1500     10

sort_values & drop_duplicates

cols = ['classcode', 'product_id', 'Season', 'Sales','Score']

df.sort_values(cols).drop_duplicates(cols, keep='last')

       classcode  product_id  Season  Sales  Score
Index                                             
2      65 102 00   210190062  2018_2   1000      5
89     66 107 00   210189987  2018_4   1500     10

Upvotes: 1

Nithin Reddy
Nithin Reddy

Reputation: 56

I think this should work.

I am just assuming your dataframe is foo

foo.groupby(['classcode','product_id','Season','Sales'])['Score'].max()

Upvotes: 2

Related Questions