Reputation: 23
I have a data set which is in a long format, and I need to convert it to wide format, but I first need to create the column names. This original dataset looks like this:
Item Date Cost
---- ----------- -----
1102 10-12-2011 $1.00
1102 10-18-2011 $8.00
1102 10-15-2011 $11.00
1103 10-16-2011 $3.00
1104 11-18-2011 $7.00
1104 11-15-2011 $5.00
And I want to transform the dataset to look like this:
Item charge_0 charge_1 charge_2
---- -------- -------- --------
1102 1.00 11.00 8.00
1103 3.00 NaN NaN
1104 5.00 7.00 NaN
The charge number (column names) indicates the order in which the charges were received for each item (earliest date first).
I've gotten some code to work, but it's really slow, and inelegant. The data set I'm working with is a couple million rows, and I need to use the code on a much larger dataset, so efficiency is important.
# Sort dataframe by Item and Date
df_sorted=df.sort_values['Item','Date'], ascending=[1,1])
df_sorted.reset_index(drop=True, inplace=True)
# Get the order or charges (based on date) for each item
df_cost=df_sorted.groupby('Item').apply(lambda x:x['Cost']
.reset_index()).reset_index()
df_cost['colName']='charge_' + df_cost['level_1'].astype(str)
# Transform data from long to wide format
df_long=df_cost.pivot(index='Item', columns='colName', values='Cost')
Any suggestions or advice would be greatly appreciated. Thanks, Nicole
Upvotes: 1
Views: 586
Reputation: 375485
You can get the charge number from cumcount:
In [11]: df["charge"] = df.groupby(["Item"]).cumcount()
In [12]: df.pivot_table("Cost", index="Item", columns="charge", aggfunc=lambda x: x)
Out[12]:
charge 0 1 2
Item
1102 $1.00 $8.00 $11.00
1103 $3.00 NaN NaN
1104 $7.00 $5.00 NaN
Upvotes: 1