Reputation: 123
I need to drop all duplicates using Pandas, except for the ones where the cell contains a certain string.
Given that DF is:
NAME ID
Joe 110
Joe 123
Joe PENDING
Mary PENDING
Mary 110
Justin 123
I need to keep rows where 'ID' is PENDING, and at the same time drop the rest of the duplicates.
Desired output looks like this:
NAME ID
Joe 110
Joe 123
Joe PENDING
Mary PENDING
Upvotes: 4
Views: 1380
Reputation: 139
Just add another column in case you wanted to keep both Name
and ID
unique:
import pandas as pd
data = [['Joe', 110],
['Joe', 123],
['Joe', 'PENDING'],
['Mary', 'PENDING'],
['Mary', 'PENDING'],
['es', 110],
['Joe', 110],
['Joe', 123]]
df = pd.DataFrame(data=data, columns=['NAME', 'ID'])
df = df[~df.duplicated(['ID', 'NAME']) | (df['ID'] == 'PENDING')]
print(df)
Upvotes: 1
Reputation: 863741
Use Series.duplicated
with inverting by ~
and chain by |
for bitwise OR
with compare ID
for PENDING
for mask for filter by boolean indexing
:
df = df[~df['ID'].duplicated() | df['ID'].eq('PENDING')]
print (df)
NAME ID
0 Joe 110
1 Joe 123
2 Joe PENDING
3 Mary PENDING
Upvotes: 3
Reputation: 61930
You could use duplicated:
import pandas as pd
data = [['Joe', 110],
['Joe', 123],
['Joe', 'PENDING'],
['Mary', 'PENDING'],
['Mary', 110],
['Justin', 123]]
df = pd.DataFrame(data=data, columns=['NAME', 'ID'])
print(df[~df.duplicated('ID') | (df['ID'] == 'PENDING')])
As an alternative you could do:
print(df[df.ID.duplicated(keep='last') | df.ID.eq('PENDING')])
Output
NAME ID
0 Joe 110
1 Joe 123
2 Joe PENDING
3 Mary PENDING
Upvotes: 6