william007
william007

Reputation: 18545

Pandas to have one row per email

Say I have the following table, Peter and Halla,

Name    Age occupation  BillingContactEmail
Peter   44  Salesman    [email protected]
Andy    43  Manager [email protected]
Halla   33  Fisherman   [email protected]

how to make pandas to contain

Name    Age occupation  BillingContactEmail
Peter   44  Salesman    [email protected]
Halla   33  Fisherman   [email protected]

where we only contain an instance for an email? (meaning we will have distinct email in the end)

Upvotes: 0

Views: 58

Answers (1)

piRSquared
piRSquared

Reputation: 294488

use drop_duplicates

df.drop_duplicates(subset=['BillingContactEmail'])

    Name  Age occupation BillingContactEmail
0  Peter   44   Salesman             [email protected]
2  Halla   33  Fisherman             [email protected]

Addressing @DSM's comment

You should be more specific about what criterion you want to use to decide which one to keep. The first seen with that email? The oldest? Etc.

By default, drop_duplicates keeps the first instance found. This is equivalent to

df.drop_duplicates(subset=['BillingContactEmail'], keep='first')

However, you could also specify to keep the last instance via keep='last'

df.drop_duplicates(subset=['BillingContactEmail'], keep='last')

    Name  Age occupation BillingContactEmail
1   Andy   43    Manager             [email protected]
2  Halla   33  Fisherman             [email protected]

Or, drop all duplicates

df.drop_duplicates(subset=['BillingContactEmail'], keep=False)

    Name  Age occupation BillingContactEmail
2  Halla   33  Fisherman             [email protected]

Upvotes: 4

Related Questions