MEhsan
MEhsan

Reputation: 2324

How to Remove the Data in a Specific Column for the Duplicate IDs?

I have this simple dataframe:

ID  Name    State
1   John    DC
1   John    VA
2   Smith   NE
3   Janet   CA
3   Janet   NC
3   Janet   MD      

I want to delete the State value for the duplicate IDs like so:

ID  Name    State
1   John    nan
1   John    nan
2   Smith   NE
3   Janet   nan
3   Janet   nan
3   Janet   nan

Any idea how to solve this problem?

Thanks,

Upvotes: 2

Views: 60

Answers (2)

shivsn
shivsn

Reputation: 7828

you can use np.where:

In[25]:df['State']=np.where(df['Name'].duplicated(keep=False),np.nan,df['State'])

In[26]:df
Out[26]: 
   ID   Name State
0   1   John   NaN
1   1   John   NaN
2   2  Smith    NE
3   3  Janet   NaN
4   3  Janet   NaN
5   3  Janet   NaN

Timings:

%timeit df.loc[df.duplicated(subset=['ID'], keep=False), 'State'] = None
100 loops, best of 3: 2.32 ms per loop

%timeit df['State']=np.where(df['Name'].duplicated(keep=False),np.nan,df['State'])
1000 loops, best of 3: 657 µs per loop

Upvotes: 2

piRSquared
piRSquared

Reputation: 294228

duplicated returns a boolean mask where rows are duplicated over the columns defined in subset. keep=False indicates that we shouldn't consider the first or last of the duplicates as non-duplicate. Using loc then allows us to assign to the rows where duplicates happen.

df.loc[df.duplicated(subset=['ID'], keep=False), 'State'] = None

df

enter image description here

Upvotes: 2

Related Questions