Reputation: 431
I have a pandas dataframe with multiple duplicate IDs as such:
id col1 col2 col3
ENE80R N N Y
ENE80R N N N
ENE80R Y N N
What I'm trying to achieve is this:
id col1 col2 col3
ENE80R Y N Y
I tried using df.drop_duplicates but it doesn't update all the Ys into one column, it just keeps the first row and removes the duplicates. Any idea how I can go about this?
EDIT: The data doesn't only contain Y and N in it, there are columns that contain data and other text and integer data. But the only data that differ in the duplicated data are the Ys and Ns.
Upvotes: 0
Views: 352
Reputation: 30605
Perhaps a simple max by groupby since Y is greater than N. i.e
df.groupby('id').max()
col1 col2 col3
id
ENE80R Y N Y
Or Thanks @pirSquared:
df.set_index('id').max(level=0)
Upvotes: 1
Reputation: 294218
You want to replace 'Y'
for True
and 'N'
for False
then do a groupby
with any
. Finally, clean it up by reversing the replace.
df.replace(dict(Y=True, N=False)) \
.groupby('id').any() \
.replace({True: 'Y', False: 'N'})
col1 col2 col3
id
ENE80R Y N Y
Alternatively, you could use pd.DataFrame.any
with the level
parameter.
df.set_index('id').eq('Y').any(level=0).replace({True: 'Y', False: 'N'})
col1 col2 col3
id
ENE80R Y N Y
Upvotes: 3