Reputation: 21
I have a table with customer IDs and emails. Some users have multiple emails. The table looks something like this:
| Customer | Email |
| ----------| -------------- |
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
| 1 | [email protected]|
What I would like to do is rearrange the table such that each customer ID has only one row and secondary emails are added as additional columns. Something like this:
| Customer | Email1 |Email2 |
| ----------| -------------- |---------------|
| 1 | [email protected] |[email protected]
| 2 | [email protected] | |
| 3 | [email protected] | |
What's the best way to do this using pandas? I have tried using df.pivot but that doesn't seem to be working for me.
Upvotes: 0
Views: 89
Reputation: 1319
You can use Series.duplicated()
+ pd.merge()
+ DataFrame.drop_duplicates()
# We get the Customers with more than one email.
df_seconds_email = df[df['Customer'].duplicated()]
# We merge your original dataframe (I called it 'df') and the above one, suffixes param help us to get
# 'Email2' column, finally we drop duplicates taking into account 'Customer' column.
df = pd.merge(df, df_seconds_email, how='left', on=['Customer'], suffixes=('', '2')).drop_duplicates(subset='Customer')
print(df)
Output:
Customer Email Email2
0 1 [email protected] [email protected]
1 2 [email protected] NaN
2 3 [email protected] NaN
Upvotes: 1
Reputation: 2947
You can use cumcount
to create MultiIndex. Then reshape the data by using unstack
and add change columns names by add_prefix
:
df = (df.set_index(['Customer',df.groupby('Customer').cumcount()])['Email']
.unstack()
.add_prefix('Email')
.reset_index())
print(df)
And you'll get exactly what you want.
Upvotes: 0