kilgoretrout
kilgoretrout

Reputation: 3657

Python: pivot a pandas DataFrame when the desired index Series has duplicates

I have a pandas DataFrame my_data that looks like

    event_id    user_id    attended
0     13          345         1
1     14          654         0
...

So event_id and user_id both have duplicates because there is an entry for each user and event combination. What I want to do is reshape this into a DataFrame where my indices (rows) are the DISTINCT user_id's, the columns are the DISTINCT event_id's and the values in a given (row, col) is just the boolean 0 or 1 of whether they attended.

It seems that the pivot method is appropriate but of course when I tried my_data.pivot(index='user_id', columns='event_id', values='attended') I got the error that the index has duplicates.

I was thinking I should do some kind of groupby on the user_id's first but I don't want to add up all the attended 1's and 0's for each user because I specifically want to separate the event_id's as my new columns and keep separate which event was attended by each user.

Any help would be greatly appreciated, thanks!

Upvotes: 2

Views: 401

Answers (1)

DSM
DSM

Reputation: 353359

IIUC, pivot_table should give you what you want:

>>> df = pd.DataFrame({"event_id": np.random.randint(10, 20, 20), "user_id": np.random.randint(100, 110, 20), "attended": np.random.randint(0, 2, 20)})
>>> df.pivot_table(index="user_id", columns="event_id", values="attended", 
    aggfunc=sum).fillna(0)
event_id  10  11  12  13  14  15  16  17  19
user_id                                     
101        0   0   0   1   0   0   0   0   0
103        0   0   0   0   0   0   0   0   0
104        0   0   0   0   0   0   0   0   1
105        0   0   0   0   0   0   0   0   0
106        0   0   0   0   0   0   1   0   0
107        1   0   0   0   0   0   0   1   0
108        0   0   0   1   0   0   0   0   0
109        0   0   0   0   1   0   1   0   0

As written, if there are multiple rows with the same user/event combination (which probably isn't the case) the attendance will be summed. It's easy enough to use any or clip the values instead if you want to guarantee the frame consists only of 0s and 1s.

Upvotes: 3

Related Questions