VRumay
VRumay

Reputation: 123

Remove/Drop duplicates EXCEPT where row contains certain string

I need to drop all duplicates using Pandas, except for the ones where the cell contains a certain string.

Given that DF is:

NAME     ID        
Joe      110
Joe      123
Joe     PENDING
Mary    PENDING
Mary     110
Justin   123

I need to keep rows where 'ID' is PENDING, and at the same time drop the rest of the duplicates.

Desired output looks like this:

NAME     ID        
Joe      110
Joe      123
Joe     PENDING
Mary    PENDING

Upvotes: 4

Views: 1380

Answers (3)

elomat
elomat

Reputation: 139

Just add another column in case you wanted to keep both Name and ID unique:

import pandas as pd

data = [['Joe', 110],
        ['Joe', 123],
        ['Joe', 'PENDING'],
        ['Mary', 'PENDING'],
        ['Mary', 'PENDING'],
        ['es', 110],
        ['Joe', 110],
        ['Joe', 123]]

df = pd.DataFrame(data=data, columns=['NAME', 'ID'])

df = df[~df.duplicated(['ID', 'NAME']) | (df['ID'] == 'PENDING')]

print(df)

Upvotes: 1

jezrael
jezrael

Reputation: 863741

Use Series.duplicated with inverting by ~ and chain by | for bitwise OR with compare ID for PENDING for mask for filter by boolean indexing:

df = df[~df['ID'].duplicated() | df['ID'].eq('PENDING')]
print (df)
   NAME       ID
0   Joe      110
1   Joe      123
2   Joe  PENDING
3  Mary  PENDING

Upvotes: 3

Dani Mesejo
Dani Mesejo

Reputation: 61930

You could use duplicated:

import pandas as pd

data = [['Joe', 110],
        ['Joe', 123],
        ['Joe', 'PENDING'],
        ['Mary', 'PENDING'],
        ['Mary', 110],
        ['Justin', 123]]

df = pd.DataFrame(data=data, columns=['NAME', 'ID'])

print(df[~df.duplicated('ID') | (df['ID'] == 'PENDING')])

As an alternative you could do:

print(df[df.ID.duplicated(keep='last') | df.ID.eq('PENDING')])

Output

   NAME       ID
0   Joe      110
1   Joe      123
2   Joe  PENDING
3  Mary  PENDING

Upvotes: 6

Related Questions