metersk
metersk

Reputation: 12539

Pivoting a pandas dataframe with duplicate index values

I have a data frame which has rows for each user joining my site and making a purchase.

+---+-----+--------------------+---------+--------+-----+
|   | uid |        msg         |  _time  | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 |   1 | confirmed_settings | 1/29/15 | M      |  37 |
| 1 |   1 | sale               | 4/13/15 | M      |  37 |
| 2 |   3 | confirmed_settings | 4/19/15 | M      |  35 |
| 3 |   4 | confirmed_settings | 2/21/15 | M      |  21 |
| 4 |   5 | confirmed_settings | 3/28/15 | M      |  18 |
| 5 |   4 | sale               | 3/15/15 | M      |  21 |
+---+-----+--------------------+---------+--------+-----+

I would like to change the dataframe so that each row is unique for a uid and there is a columns called sale and confirmed_settings which have the timestamp of the action. Note that not every user has a sale, but every user has a confirmed_settings. Like below:

+---+-----+--------------------+---------+---------+--------+-----+
|   | uid | confirmed_settings |  sale   |  _time  | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 |   1 | 1/29/15            | 4/13/15 | 1/29/15 | M      |  37 |
| 1 |   3 | 4/19/15            | null    | 4/19/15 | M      |  35 |
| 2 |   4 | 2/21/15            | 3/15/15 | 2/21/15 | M      |  21 |
| 3 |   5 | 3/28/15            | null    | 3/28/15 | M      |  18 |
+---+-----+--------------------+---------+---------+--------+-----+

To do this, I am trying:

df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')

But I get this error: ValueError: Index contains duplicate entries, cannot reshape

How can I pivot a df with duplicate index values to transform my dataframe?


Edit: df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()

gives this error DataError: No numeric types to aggregate but im not even sure that is the right path to go on.

Upvotes: 8

Views: 31591

Answers (3)

selwyth
selwyth

Reputation: 2507

I suspect there are indeed duplicate uid-msg entries/keys (e.g. uid 2 has 2 confirmed_settings entries under msg), which you alluded to in the comments for fixxxer's answer. If there are, you can't use pivot, because you can't tell it how to treat the different values encountered during aggregation (count? max? mean? sum?). Note that the Index error is an error on the Index of the resulting pivoted table df1, not the original DataFrame df.

pivot_table lets you do it however with the aggfunc argument. How about something like this?

df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)

This will help you figure out which user-msg records have duplicate entries (anything with over 1), and after cleaning them out, you can use pivot on df to successfully pivot _time.

Upvotes: 8

fixxxer
fixxxer

Reputation: 16174

x is the data frame that you have as input :

    uid               msg   _time   gender  age
0   1   confirmed_settings  1/29/15 M       37
1   1   sale                4/13/15 M       37
2   3   confirmed_settings  4/19/15 M       35
3   4   confirmed_settings  2/21/15 M       21
4   5   confirmed_settings  3/28/15 M       18
5   4   sale                3/15/15 M       21

y = x.pivot(index='uid', columns='msg', values='_time')
x.join(y).drop('msg', axis=1)

gives you:

    uid _time   gender  age     confirmed_settings  sale
0   1   1/29/15     M   37                    NaN   NaN
1   1   4/13/15     M   37                1/29/15   4/13/15
2   3   4/19/15     M   35                    NaN   NaN
3   4   2/21/15     M   21                4/19/15   NaN
4   5   3/28/15     M   18                2/21/15   3/15/15
5   4   3/15/15     M   21                3/28/15   NaN

Upvotes: 4

Alexander
Alexander

Reputation: 109746

You can use groupby to aggregate by the common factors, take the max of time to get the most recent dates, and then unstack the msg to view confirmed_settings and sale side by side:

df.groupby(['uid', 'msg', 'gender', 'age']).time.max().unstack('msg')

msg            confirmed_settings     sale
uid gender age                            
1   M      37             1/29/15  4/13/15
3   M      35             4/19/15      NaN
4   M      21             2/21/15  3/15/15
5   M      18             3/28/15      NaN

Upvotes: 3

Related Questions