Reputation: 2324
I have this simple dataframe:
ID Name State
1 John DC
1 John VA
2 Smith NE
3 Janet CA
3 Janet NC
3 Janet MD
I want to delete the State
value for the duplicate IDs
like so:
ID Name State
1 John nan
1 John nan
2 Smith NE
3 Janet nan
3 Janet nan
3 Janet nan
Any idea how to solve this problem?
Thanks,
Upvotes: 2
Views: 60
Reputation: 7828
you can use np.where:
In[25]:df['State']=np.where(df['Name'].duplicated(keep=False),np.nan,df['State'])
In[26]:df
Out[26]:
ID Name State
0 1 John NaN
1 1 John NaN
2 2 Smith NE
3 3 Janet NaN
4 3 Janet NaN
5 3 Janet NaN
Timings:
%timeit df.loc[df.duplicated(subset=['ID'], keep=False), 'State'] = None
100 loops, best of 3: 2.32 ms per loop
%timeit df['State']=np.where(df['Name'].duplicated(keep=False),np.nan,df['State'])
1000 loops, best of 3: 657 µs per loop
Upvotes: 2
Reputation: 294228
duplicated
returns a boolean mask where rows are duplicated over the columns defined in subset
. keep=False
indicates that we shouldn't consider the first or last of the duplicates as non-duplicate. Using loc
then allows us to assign to the rows where duplicates happen.
df.loc[df.duplicated(subset=['ID'], keep=False), 'State'] = None
df
Upvotes: 2