Reputation: 642
I have data in a Pandas dataframe in the format:
CompanyA, CompanyB, Currency, Item, Amount
Typical rows might be:
Microsoft,Oracle,USD,Item_X,252.23
Microsoft,Oracle,USD,Item_Y,234.23
Microsoft,Oracle,EUR,Item_X,23352.00
Microsoft,Oracle,EUR,Item_Y,23344.80
Microsoft,IBM,GBP,Item_X,123.12
Microsoft,IBM,GBP,Item_Y,432.12
Oracle,IBM,EUR,Item_X,999.23
Oracle,IBM,EUR,Item_Y,234.23
Amount if a float, the others are strings.
I want to expand the Item column out so each Item entry gets its own column, which now contains the Amount. Basically make the data wider, rather than longer.
CompanyA, CompanyB, Currency, Item_X, Item_Y
Microsoft,Oracle,USD,252.23, 234.23
Microsoft,Oracle,EUR,,23352.00,23344.80
... and so on.
It feels like it should be a self-join - and I've tried things like:
df = pd.merge(df, df, on=['CompanyA', 'CompanyB', 'Currency'])
This produces almost the right output, but it joins each row 4 times:
Item_X -> Item_X
Item_X -> Item_Y
Item_Y -> Item_X
Item_Y -> Item_Y
Obviously I'm only interested in Item_X -> Item_Y.
In SQL you would further contrain the query, and this is where I'm getting stuck - how do this in Pandas? Or is there an easier approach!
Cheers!
Phil.
Upvotes: 2
Views: 928
Reputation: 862406
I think you need set_index
with unstack
:
df = df.set_index(['CompanyA','CompanyB','Currency','Item'])['Amount']
.unstack()
.reset_index()
print (df)
Item CompanyA CompanyB Currency Item_X Item_Y
0 Microsoft IBM GBP 123.12 432.12
1 Microsoft Oracle EUR 23352.00 23344.80
2 Microsoft Oracle USD 252.23 234.23
3 Oracle IBM EUR 999.23 234.23
Or if duplicates need pivot_table
with aggregate function:
print (df)
CompanyA CompanyB Currency Item Amount
0 Microsoft Oracle USD Item_X 252.23
1 Microsoft Oracle USD Item_Y 234.23
2 Microsoft Oracle EUR Item_X 23352.00
3 Microsoft Oracle EUR Item_Y 23344.80
4 Microsoft IBM GBP Item_X 123.12
5 Microsoft IBM GBP Item_Y 432.12
6 Oracle IBM EUR Item_X 999.23
7 Oracle IBM EUR Item_Y 10.00 <-same values, only Amount different
8 Oracle IBM EUR Item_Y 20.00 <-same values, only Amount different
df = df.pivot_table(index=['CompanyA','CompanyB','Currency'],
columns='Item',
values='Amount',
aggfunc='mean').reset_index()
print (df)
Item CompanyA CompanyB Currency Item_X Item_Y
0 Microsoft IBM GBP 123.12 432.12
1 Microsoft Oracle EUR 23352.00 23344.80
2 Microsoft Oracle USD 252.23 234.23
3 Oracle IBM EUR 999.23 15.00
Upvotes: 2