Reputation: 109

Python: Remove duplicates from DataFrame based on another column value

I have the following table with fictive data:

I want to remove any duplicate rows and keep only the row which contains a positive value in "Won Turnover". Hence, the two rows marked with red should be removed in this case

Moreover, if there are duplicate rows with only Lost Turnover, then the row with the highest turnover should be kept (The bottom two rows).

Upvotes: 2

Answers (2)

jezrael

Reputation: 863731

First test Won Turnover if only missing value per groups with GroupBy.all and test only max value per Lost Turnover. Chain by & for bitwise AND and add new condition for return all not missing rows per Won Turnover with | for bitwise OR:

m1 = (df.assign(new = df['Won Turnover'].isna())
        .groupby(['Date','Supplier','Customer'])['new'].transform('all'))
m2 = (df.groupby(['Date','Supplier','Customer'])['Lost Turnover'].transform('max')
        .eq(df['Lost Turnover']))
df = df[(m1 & m2) | df['Won Turnover'].notna()]
print (df)

         Date Supplier   Customer  Won Turnover  Lost Turnover
1  25.06.2019     Nike      Pepsi       25000.0            NaN
2  25.06.2019     Nike  McDonalds       10000.0            NaN
3  25.06.2019   Adidas  Coca Cola       12000.0            NaN
5  25.06.2019   Adidas  McDonalds       35000.0            NaN
6  25.06.2019   Adidas      Pepsi           NaN        15000.0

Upvotes: 0

zipa

Reputation: 27899

Maybe this can do it:

df.sort_values(['Won Turnover', 'Lost Turnover'], ascending=False).drop_duplicates('Supplier')

Upvotes: 3

Python: Remove duplicates from DataFrame based on another column value

Answers (2)

Related Questions