Self-Join in Pandas producing unwanted Duplicates

Question

I have data in a Pandas dataframe in the format:

CompanyA, CompanyB, Currency, Item, Amount

Typical rows might be:

Microsoft,Oracle,USD,Item_X,252.23
Microsoft,Oracle,USD,Item_Y,234.23
Microsoft,Oracle,EUR,Item_X,23352.00
Microsoft,Oracle,EUR,Item_Y,23344.80
Microsoft,IBM,GBP,Item_X,123.12
Microsoft,IBM,GBP,Item_Y,432.12
Oracle,IBM,EUR,Item_X,999.23
Oracle,IBM,EUR,Item_Y,234.23

Amount if a float, the others are strings.

I want to expand the Item column out so each Item entry gets its own column, which now contains the Amount. Basically make the data wider, rather than longer.

CompanyA, CompanyB, Currency, Item_X, Item_Y

Microsoft,Oracle,USD,252.23, 234.23
Microsoft,Oracle,EUR,,23352.00,23344.80
... and so on.

It feels like it should be a self-join - and I've tried things like:

df = pd.merge(df, df, on=['CompanyA', 'CompanyB', 'Currency'])

This produces almost the right output, but it joins each row 4 times:

Item_X -> Item_X
Item_X -> Item_Y
Item_Y -> Item_X
Item_Y -> Item_Y

Obviously I'm only interested in Item_X -> Item_Y.

In SQL you would further contrain the query, and this is where I'm getting stuck - how do this in Pandas? Or is there an easier approach!

Cheers!

Phil.

jezrael · Accepted Answer

I think you need set_index with unstack:

df = df.set_index(['CompanyA','CompanyB','Currency','Item'])['Amount']
       .unstack()
       .reset_index()

print (df)
Item   CompanyA CompanyB Currency    Item_X    Item_Y
0     Microsoft      IBM      GBP    123.12    432.12
1     Microsoft   Oracle      EUR  23352.00  23344.80
2     Microsoft   Oracle      USD    252.23    234.23
3        Oracle      IBM      EUR    999.23    234.23

Or if duplicates need pivot_table with aggregate function:

print (df)
    CompanyA CompanyB Currency    Item    Amount
0  Microsoft   Oracle      USD  Item_X    252.23
1  Microsoft   Oracle      USD  Item_Y    234.23
2  Microsoft   Oracle      EUR  Item_X  23352.00
3  Microsoft   Oracle      EUR  Item_Y  23344.80
4  Microsoft      IBM      GBP  Item_X    123.12
5  Microsoft      IBM      GBP  Item_Y    432.12
6     Oracle      IBM      EUR  Item_X    999.23
7     Oracle      IBM      EUR  Item_Y     10.00 <-same values, only Amount different
8     Oracle      IBM      EUR  Item_Y     20.00 <-same values, only Amount different


df = df.pivot_table(index=['CompanyA','CompanyB','Currency'],
                    columns='Item', 
                    values='Amount', 
                    aggfunc='mean').reset_index()
print (df)
Item   CompanyA CompanyB Currency    Item_X    Item_Y
0     Microsoft      IBM      GBP    123.12    432.12
1     Microsoft   Oracle      EUR  23352.00  23344.80
2     Microsoft   Oracle      USD    252.23    234.23
3        Oracle      IBM      EUR    999.23     15.00

Self-Join in Pandas producing unwanted Duplicates

Answers (1)

Related Questions