Reputation: 169
I am a newbie to Python and I am needing a solution for this example below, This is how my df looks like:
Index classcode product_id Season Sales Score
1 65 102 00 210190062 2018_2 1000 3
2 65 102 00 210190062 2018_2 1000 5
89 66 107 00 210189987 2018_4 1500 10
I just need the one row based on min or max value of the 'Score'column from Index 1 or 2 and the row with index 89 as well. The only different value among Index 1 and 2 is the Score which is always unique while the rest of the colunmns are identical.The Score is not the same for the same product_id or classode or any other column(s) in the df. I just want to eliminate the double counting of the Sales. Is there a function or logic in pandas to achieve this? I tried creating a new data frame by returning all columns and grouping them by the max of Score and it did not work. I have done this in SQL using window functions but not sure about what to do here. The Index is the default index created from the data frame. The expected output for the example would be like below,
Index classcode product_id Season Sales Score
2 65 102 00 210190062 2018_2 1000 5
89 66 107 00 210189987 2018_4 1500 10
Upvotes: 0
Views: 221
Reputation: 42916
There several ways to do this:
groupby
& transform
cols = ['classcode', 'product_id', 'Season', 'Sales']
df[df['Score'].eq(df.groupby(cols)['Score'].transform('max'))]
classcode product_id Season Sales Score
Index
2 65 102 00 210190062 2018_2 1000 5
89 66 107 00 210189987 2018_4 1500 10
sort_values
& drop_duplicates
cols = ['classcode', 'product_id', 'Season', 'Sales','Score']
df.sort_values(cols).drop_duplicates(cols, keep='last')
classcode product_id Season Sales Score
Index
2 65 102 00 210190062 2018_2 1000 5
89 66 107 00 210189987 2018_4 1500 10
Upvotes: 1
Reputation: 56
I think this should work.
I am just assuming your dataframe is foo
foo.groupby(['classcode','product_id','Season','Sales'])['Score'].max()
Upvotes: 2