Ekaterina
Ekaterina

Reputation: 335

Reorganization Data Frame

I have a date frame of the following kind:

|user|product|quantity|
|1   |   1   |   4    | 
|1   |   1   |   3    |
|1   |   2   |   1    |
|2   |   1   |   5    |
|2   |   1   |   8    |
|2   |   5   |   7    |
 ...    ...      ...

I want to get the date frame of the next kind:

|user|product_1|product_2|...|product_5|...|product_n|
|1   |   7     |    1    |...|   0     |...|    0    | 
|2   |   13    |    0    |...|   7     |...|   ...   |
 ...     ...       ...    ...   ...     ...    ...

How can it be done quickly and efficiently on a very large amount of data?

Upvotes: 1

Views: 86

Answers (2)

Nico Albers
Nico Albers

Reputation: 1696

Try pd.DataFrame.unstack(). That does exactly what you want.

Update: I wanted to extend this answer, but due to a meeting too late now, the other answer is good. But, by now (assuming, you only have one number per user-product, so no need for groupby):

df = df.set_index(['user', 'product'])
df = df.unstack(-1).fillna(0)  # only if you want to fill the NaNs

Upvotes: 1

Nickil Maveli
Nickil Maveli

Reputation: 29729

Use pivot_table with aggfunc="sum" to reshape the dataframe from long format to a wide one. Post that, format the column names with the help of add_prefix option.

d = df.pivot_table(index='user', columns='product', 
                   values='quantity', aggfunc='sum', fill_value=0)
d.add_prefix("product_").rename_axis([None], axis=1)

enter image description here


It's equivalent groupby syntax would be computing the sum across the grouped columns and then unstacking the dataframe. Followed by similar operations as per the afore-mentioned approach.

d1 = df.groupby(['user', 'product'])['quantity'].sum()
d1.unstack(fill_value=0).add_prefix("product_").rename_axis([None], axis=1)

enter image description here

Upvotes: 2

Related Questions