Reputation: 3657
I have a pandas DataFrame my_data
that looks like
event_id user_id attended
0 13 345 1
1 14 654 0
...
So event_id
and user_id
both have duplicates because there is an entry for each user and event combination. What I want to do is reshape this into a DataFrame where my indices (rows) are the DISTINCT user_id
's, the columns are the DISTINCT event_id
's and the values in a given (row, col) is just the boolean 0 or 1 of whether they attended.
It seems that the pivot
method is appropriate but of course when I tried my_data.pivot(index='user_id', columns='event_id', values='attended')
I got the error that the index has duplicates.
I was thinking I should do some kind of groupby
on the user_id
's first but I don't want to add up all the attended
1's and 0's for each user because I specifically want to separate the event_id
's as my new columns and keep separate which event was attended by each user.
Any help would be greatly appreciated, thanks!
Upvotes: 2
Views: 401
Reputation: 353359
IIUC, pivot_table
should give you what you want:
>>> df = pd.DataFrame({"event_id": np.random.randint(10, 20, 20), "user_id": np.random.randint(100, 110, 20), "attended": np.random.randint(0, 2, 20)})
>>> df.pivot_table(index="user_id", columns="event_id", values="attended",
aggfunc=sum).fillna(0)
event_id 10 11 12 13 14 15 16 17 19
user_id
101 0 0 0 1 0 0 0 0 0
103 0 0 0 0 0 0 0 0 0
104 0 0 0 0 0 0 0 0 1
105 0 0 0 0 0 0 0 0 0
106 0 0 0 0 0 0 1 0 0
107 1 0 0 0 0 0 0 1 0
108 0 0 0 1 0 0 0 0 0
109 0 0 0 0 1 0 1 0 0
As written, if there are multiple rows with the same user/event combination (which probably isn't the case) the attendance will be summed. It's easy enough to use any
or clip the values instead if you want to guarantee the frame consists only of 0s and 1s.
Upvotes: 3