Reputation: 12539
I have a data frame which has rows for each user joining my site and making a purchase.
+---+-----+--------------------+---------+--------+-----+
| | uid | msg | _time | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 | 1 | confirmed_settings | 1/29/15 | M | 37 |
| 1 | 1 | sale | 4/13/15 | M | 37 |
| 2 | 3 | confirmed_settings | 4/19/15 | M | 35 |
| 3 | 4 | confirmed_settings | 2/21/15 | M | 21 |
| 4 | 5 | confirmed_settings | 3/28/15 | M | 18 |
| 5 | 4 | sale | 3/15/15 | M | 21 |
+---+-----+--------------------+---------+--------+-----+
I would like to change the dataframe so that each row is unique for a uid and there is a columns called sale
and confirmed_settings
which have the timestamp of the action. Note that not every user has a sale
, but every user has a confirmed_settings
. Like below:
+---+-----+--------------------+---------+---------+--------+-----+
| | uid | confirmed_settings | sale | _time | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 | 1 | 1/29/15 | 4/13/15 | 1/29/15 | M | 37 |
| 1 | 3 | 4/19/15 | null | 4/19/15 | M | 35 |
| 2 | 4 | 2/21/15 | 3/15/15 | 2/21/15 | M | 21 |
| 3 | 5 | 3/28/15 | null | 3/28/15 | M | 18 |
+---+-----+--------------------+---------+---------+--------+-----+
To do this, I am trying:
df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')
But I get this error: ValueError: Index contains duplicate entries, cannot reshape
How can I pivot a df with duplicate index values to transform my dataframe?
Edit:
df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()
gives this error DataError: No numeric types to aggregate
but im not even sure that is the right path to go on.
Upvotes: 8
Views: 31591
Reputation: 2507
I suspect there are indeed duplicate uid
-msg
entries/keys (e.g. uid
2 has 2 confirmed_settings entries under msg
), which you alluded to in the comments for fixxxer's answer. If there are, you can't use pivot
, because you can't tell it how to treat the different values encountered during aggregation (count? max? mean? sum?). Note that the Index error is an error on the Index of the resulting pivoted table df1
, not the original DataFrame df
.
pivot_table
lets you do it however with the aggfunc
argument. How about something like this?
df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)
This will help you figure out which user-msg records have duplicate entries (anything with over 1), and after cleaning them out, you can use pivot
on df
to successfully pivot _time
.
Upvotes: 8
Reputation: 16174
x
is the data frame that you have as input :
uid msg _time gender age
0 1 confirmed_settings 1/29/15 M 37
1 1 sale 4/13/15 M 37
2 3 confirmed_settings 4/19/15 M 35
3 4 confirmed_settings 2/21/15 M 21
4 5 confirmed_settings 3/28/15 M 18
5 4 sale 3/15/15 M 21
y = x.pivot(index='uid', columns='msg', values='_time')
x.join(y).drop('msg', axis=1)
gives you:
uid _time gender age confirmed_settings sale
0 1 1/29/15 M 37 NaN NaN
1 1 4/13/15 M 37 1/29/15 4/13/15
2 3 4/19/15 M 35 NaN NaN
3 4 2/21/15 M 21 4/19/15 NaN
4 5 3/28/15 M 18 2/21/15 3/15/15
5 4 3/15/15 M 21 3/28/15 NaN
Upvotes: 4
Reputation: 109746
You can use groupby to aggregate by the common factors, take the max of time to get the most recent dates, and then unstack the msg to view confirmed_settings and sale side by side:
df.groupby(['uid', 'msg', 'gender', 'age']).time.max().unstack('msg')
msg confirmed_settings sale
uid gender age
1 M 37 1/29/15 4/13/15
3 M 35 4/19/15 NaN
4 M 21 2/21/15 3/15/15
5 M 18 3/28/15 NaN
Upvotes: 3