Hector Montero
Hector Montero

Reputation: 13

Duplicates using python, if any create a new column when there's a match

I'm currently trying to conduct an analysis where people might be doing something to avoid the system. So, I created a new field inside my dataFrame where I appended the Issue Date and the Name of the potential offender. What I want is: if any of the rows have the same Audit ID, say yes, if not, NaN.

So for example, I have:

Offender Name  Issue Date  Audit ID
Joe            12/02/2020  Joe-12/02/20
Nic            20/02/2020  Nic-20/02/20
Mat            01/02/2020  Mat-01/02/20
Joe            12/02/2020  Joe-12/02/20

And I want something like:

Offender Name  Issue Date  Audit ID        Matches
Joe            12/02/2020  Joe-12/02/20    Yes
Nic            20/02/2020  Nic-20/02/20    No
Mat            01/02/2020  Mat-01/02/20    No 
Joe            12/02/2020  Joe-12/02/20    Yes

I'd appreciate any insights anyone can give me

Upvotes: 0

Views: 51

Answers (1)

Michael Szczesny
Michael Szczesny

Reputation: 5036

You can mark duplicates with 'Yes' and 'No'

df['Matches'] = df.duplicated('Audit ID', keep=False).map({True: 'Yes',False: 'No'})
df

Out:

  Offender Name  Issue Date      Audit ID Matches
0           Joe  12/02/2020  Joe-12/02/20     Yes
1           Nic  20/02/2020  Nic-20/02/20      No
2           Mat  01/02/2020  Mat-01/02/20      No
3           Joe  12/02/2020  Joe-12/02/20     Yes

The column Audit ID is redundant. You have the same informations in your dataframe already

df['Matches'] = df.duplicated(['Offender Name','Issue Date'], keep=False).map({True: 'Yes',False: 'No'})
df

Out:

  Offender Name  Issue Date      Audit ID Matches
0           Joe  12/02/2020  Joe-12/02/20     Yes
1           Nic  20/02/2020  Nic-20/02/20      No
2           Mat  01/02/2020  Mat-01/02/20      No
3           Joe  12/02/2020  Joe-12/02/20     Yes

Upvotes: 1

Related Questions