Reputation: 18545
Say I have the following table, Peter and Halla,
Name Age occupation BillingContactEmail
Peter 44 Salesman [email protected]
Andy 43 Manager [email protected]
Halla 33 Fisherman [email protected]
how to make pandas to contain
Name Age occupation BillingContactEmail
Peter 44 Salesman [email protected]
Halla 33 Fisherman [email protected]
where we only contain an instance for an email? (meaning we will have distinct email in the end)
Upvotes: 0
Views: 58
Reputation: 294488
use drop_duplicates
df.drop_duplicates(subset=['BillingContactEmail'])
Name Age occupation BillingContactEmail
0 Peter 44 Salesman [email protected]
2 Halla 33 Fisherman [email protected]
Addressing @DSM's comment
You should be more specific about what criterion you want to use to decide which one to keep. The first seen with that email? The oldest? Etc.
By default, drop_duplicates
keeps the first instance found. This is equivalent to
df.drop_duplicates(subset=['BillingContactEmail'], keep='first')
However, you could also specify to keep the last instance via keep='last'
df.drop_duplicates(subset=['BillingContactEmail'], keep='last')
Name Age occupation BillingContactEmail
1 Andy 43 Manager [email protected]
2 Halla 33 Fisherman [email protected]
Or, drop all duplicates
df.drop_duplicates(subset=['BillingContactEmail'], keep=False)
Name Age occupation BillingContactEmail
2 Halla 33 Fisherman [email protected]
Upvotes: 4