mober
mober

Reputation: 109

Python: Remove duplicates from DataFrame based on another column value

I have the following table with fictive data:

enter image description here

I want to remove any duplicate rows and keep only the row which contains a positive value in "Won Turnover". Hence, the two rows marked with red should be removed in this case

Moreover, if there are duplicate rows with only Lost Turnover, then the row with the highest turnover should be kept (The bottom two rows).

Upvotes: 2

Views: 853

Answers (2)

jezrael
jezrael

Reputation: 862581

First test Won Turnover if only missing value per groups with GroupBy.all and test only max value per Lost Turnover. Chain by & for bitwise AND and add new condition for return all not missing rows per Won Turnover with | for bitwise OR:

m1 = (df.assign(new = df['Won Turnover'].isna())
        .groupby(['Date','Supplier','Customer'])['new'].transform('all'))
m2 = (df.groupby(['Date','Supplier','Customer'])['Lost Turnover'].transform('max')
        .eq(df['Lost Turnover']))
df = df[(m1 & m2) | df['Won Turnover'].notna()]
print (df)

         Date Supplier   Customer  Won Turnover  Lost Turnover
1  25.06.2019     Nike      Pepsi       25000.0            NaN
2  25.06.2019     Nike  McDonalds       10000.0            NaN
3  25.06.2019   Adidas  Coca Cola       12000.0            NaN
5  25.06.2019   Adidas  McDonalds       35000.0            NaN
6  25.06.2019   Adidas      Pepsi           NaN        15000.0

Upvotes: 0

zipa
zipa

Reputation: 27869

Maybe this can do it:

df.sort_values(['Won Turnover', 'Lost Turnover'], ascending=False).drop_duplicates('Supplier')

Upvotes: 3

Related Questions