Reputation: 335
I have a date frame of the following kind:
|user|product|quantity|
|1 | 1 | 4 |
|1 | 1 | 3 |
|1 | 2 | 1 |
|2 | 1 | 5 |
|2 | 1 | 8 |
|2 | 5 | 7 |
... ... ...
I want to get the date frame of the next kind:
|user|product_1|product_2|...|product_5|...|product_n|
|1 | 7 | 1 |...| 0 |...| 0 |
|2 | 13 | 0 |...| 7 |...| ... |
... ... ... ... ... ... ...
How can it be done quickly and efficiently on a very large amount of data?
Upvotes: 1
Views: 86
Reputation: 1696
Try pd.DataFrame.unstack()
. That does exactly what you want.
Update: I wanted to extend this answer, but due to a meeting too late now, the other answer is good. But, by now (assuming, you only have one number per user-product, so no need for groupby):
df = df.set_index(['user', 'product'])
df = df.unstack(-1).fillna(0) # only if you want to fill the NaNs
Upvotes: 1
Reputation: 29729
Use pivot_table
with aggfunc="sum"
to reshape the dataframe from long format to a wide one. Post that, format the column names with the help of add_prefix
option.
d = df.pivot_table(index='user', columns='product',
values='quantity', aggfunc='sum', fill_value=0)
d.add_prefix("product_").rename_axis([None], axis=1)
It's equivalent groupby
syntax would be computing the sum across the grouped columns and then unstacking the dataframe. Followed by similar operations as per the afore-mentioned approach.
d1 = df.groupby(['user', 'product'])['quantity'].sum()
d1.unstack(fill_value=0).add_prefix("product_").rename_axis([None], axis=1)
Upvotes: 2