Reputation: 97
I have two dataframes that i want to merge, but my key column contains duplicates. Dataframes looks like this:
Name,amount,id
John,500.25,GH10
Helen,1250.00,GH11
Adam,432.54,GH11
Sarah,567.12,GH12
Category,amount,id
Food,500.25,GH10
Travel,1250.00,GH11
Food,432.54,GH11
And I'm performing on it merge with outer join to include everything in merged table:
merged_table = pd.merge(df1,df2,on="id",how='outer')
And my output is:
Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Helen,1250.00,GH11,Food,432.54
Adam,432.54,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12
However, my desired output is:
Name,amount_x,id,category,amount_y
John,500.25,GH10,Food,500.25
Helen,1250.00,GH11,Travel,1250.00
Adam,432.54,GH11,Food,432.54
Sarah,567.12,GH12
So what's happening here is that each record with duplicated key is matched with every record on other table, so the output have 4 rows instead of 2, and these two in the middle (row 2 and 3) are unwanted.
So the solutions that comes to my mind:
Preventing somehow creation of duplicated rows. I can't use drop_duplicates() before merge, because then i would exclude some of the rows with doubled key. But the other column, Amount, should have the same 2 values on both tables, but there is very small possibility that they may differ.
Using merge in the same way as i'm doing it, but then dropping rows 2 and 3 and keeping rows 1 and 4, if ID is duplicated, because as matching goes in way where first row in df1 is connected with first row in df2, then second row in df2, and then second row from df1 is connected with first row in df2 and then with the second, rows 1 and 4 are the one that are correct.
I'm think here of using .apply() and writing some lambda function, but i can't really wrap my head around how it should be written correctly.
Upvotes: 7
Views: 5663
Reputation: 635
On your output(after merge
) you can apply below. Also in a single shot we can do it but I suggest you to figure it out. Giving you the hint...
>>> df.drop_duplicates('Name',keep='first')
Name amount_x id category amount_y
0 John 500.25 GH10 Food 500.25
1 Helen 1250.00 GH11 Travel 1250
3 Adam 432.54 GH11 Travel 1250
5 Sarah 567.12 GH12
Upvotes: 1
Reputation: 863351
I suggest create new helper column for count id
values by cumcount
and then merge by this values:
df1['g'] = df1.groupby('id').cumcount()
df2['g'] = df2.groupby('id').cumcount()
merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer')
print (merged_table)
Name amount_x id g Category amount_y
0 John 500.25 GH10 0 Food 500.25
1 Helen 1250.00 GH11 0 Travel 1250.00
2 Adam 432.54 GH11 1 Food 432.54
3 Sarah 567.12 GH12 0 NaN NaN
And last remove id
:
merged_table = pd.merge(df1,df2,on=["id", 'g'],how='outer').drop('g', axis=1)
print (merged_table)
Name amount_x id Category amount_y
0 John 500.25 GH10 Food 500.25
1 Helen 1250.00 GH11 Travel 1250.00
2 Adam 432.54 GH11 Food 432.54
3 Sarah 567.12 GH12 NaN NaN
Detail:
print (df1)
Name amount id g
0 John 500.25 GH10 0
1 Helen 1250.00 GH11 0
2 Adam 432.54 GH11 1
3 Sarah 567.12 GH12 0
print (df2)
Category amount id g
0 Food 500.25 GH10 0
1 Travel 1250.00 GH11 0
2 Food 432.54 GH11 1
Upvotes: 14