Reputation: 103

Why merging 2 data frames gives me one with triple the rows

I have df1:

       x            y no.
0  -17.7    -0.785430  y1
1  -15.0 -3820.085000  y4
2  -12.5     2.138833  y3
..  ....     ........  ..
40  15.6     5.486901  y2
41  19.2     1.980686  y3
42  19.6     9.364718  y2

and df2:

       delta y     x
0     0.053884 -17.7
1     0.085000 -15.0
2     0.143237 -12.5
..    ........  ....
40    0.113099  15.6
41    0.102245  19.2
42    0.235282  19.6

They both have 43 rows, and x column is exactly the same on both.

Somehow when I merge them on x I get a df with 123 rows:

 x            y no.   delta y
0   -17.7    -0.785430  y1  0.053884
1   -15.0 -3820.085000  y4  0.085000
2   -12.5     2.138833  y3  0.143237
3   -12.4     1.721205  y3  0.251180
4   -12.1     2.227343  y2  0.127343
..    ...          ...  ..       ...
118  12.1     1.642526  y3  0.143886
119  14.4  2576.435000  y4  0.171000
120  15.6     5.486901  y2  0.113099
121  19.2     1.980686  y3  0.102245
122  19.6     9.364718  y2  0.235282

My input: final = df1.merge(df2, on="x") x float64 y float64 no. object dtype: object

delta y float64 x float64 dtype: object

x float64 y float64 no. object dtype: object

delta y float64 x float64 dtype: object

x float64 y float64 no. object dtype: object

delta y float64 x float64 dtype: object

df1 = pd.DataFrame({'x': {0: -17.7, 1: -15.0, 2: -12.5, 3: -12.4, 4: -12.1, 5: -11.2, 6: -8.9, 7: -7.5, 8: -7.5, 9: -6.0, 10: -6.0, 11: -4.7, 12: -4.1, 13: -3.8, 14: -3.4, 15: -3.4, 16: -1.9, 17: -1.5, 18: -1.1, 19: -0.4, 20: -0.1, 21: 3.5, 22: 3.8, 23: 5.3, 24: 5.3, 25: 5.3, 26: 5.3, 27: 5.3, 28: 5.3, 29: 5.3, 30: 5.3, 31: 5.3, 32: 6.4, 33: 6.8, 34: 6.8, 35: 10.2, 36: 10.3, 37: 11.9, 38: 12.1, 39: 14.4, 40: 15.6, 41: 19.2, 42: 19.6}, 'y': {0: -0.7854295, 1: -3820.085, 2: 2.1388333, 3: 1.7212046, 4: 2.227343, 5: 0.04315967, 6: -0.9616607, 7: -1.9878536, 8: -0.52237016, 9: -283.27216, 10: -282.5332, 11: -0.4335017, 12: -1.1585577, 13: -0.008831219, 14: 848.92303, 15: -57.407845, 16: -9.010686, 17: -3.2473037, 18: 0.5536767, 19: 1.8351307, 20: 4.8347697, 21: -6.45842, 22: -1.5683812, 23: 0.9338831, 24: 0.9338831, 25: 97.65833, 26: 1.6500127, 27: 1.6500127, 28: 97.65833, 29: 97.65833, 30: 1.6500127, 31: 97.65833, 32: -3.655422, 33: 1.9058462, 34: 227.5592, 35: 857.7455, 36: -0.68584794, 37: 1.6785516, 38: 1.6425261, 39: 2576.435, 40: 5.4869013, 41: 1.9806856, 42: 9.364718}, 'no.': {0: 'y1', 1: 'y4', 2: 'y3', 3: 'y3', 4: 'y2', 5: 'y3', 6: 'y2', 7: 'y2', 8: 'y2', 9: 'y4', 10: 'y4', 11: 'y1', 12: 'y3', 13: 'y1', 14: 'y4', 15: 'y4', 16: 'y4', 17: 'y4', 18: 'y1', 19: 'y3', 20: 'y4', 21: 'y2', 22: 'y3', 23: 'y3', 24: 'y3', 25: 'y4', 26: 'y3', 27: 'y3', 28: 'y4', 29: 'y3', 30: 'y4', 31: 'y4', 32: 'y2', 33: 'y3', 34: 'y3', 35: 'y4', 36: 'y3', 37: 'y3', 38: 'y3', 39: 'y4', 40: 'y2', 41: 'y3', 42: 'y2'}})

df2 = pd.DataFrame({'delta y': {0: 0.05388353000000001, 1: 0.08500000000003638, 2: 0.14323679999999994, 3: 0.25117999999999996, 4: 0.12734299999999976, 5: 0.36285006000000003, 6: 0.13833930000000005, 7: 0.5121464, 8: 1.97762984, 9: 0.2721599999999853, 10: 0.4667999999999779, 11: 0.2692114, 12: 0.00890970000000002, 13: 0.314458351, 14: 906.34703, 15: 0.0161549999999977, 16: 0.06831400000000087, 17: 0.3723036999999998, 18: 0.2988478, 19: 0.006991300000000145, 20: 0.14423030000000026, 21: 0.04157999999999973, 22: 0.013554200000000183, 23: 0.17486560000000007, 24: 0.17486560000000007, 25: 0.03866999999999621, 26: 0.541264, 27: 0.541264, 28: 0.03866999999999621, 29: 96.5495813, 30: 96.0469873, 31: 0.03866999999999621, 32: 0.05542200000000008, 33: 0.1670513, 34: 225.82040510000002, 35: 0.38250000000005, 36: 0.59580486, 37: 0.10641100000000003, 38: 0.14388610000000002, 39: 0.17099999999982174, 40: 0.11309869999999922, 41: 0.10224489999999986, 42: 0.23528199999999977}, 'x': {0: -17.7, 1: -15.0, 2: -12.5, 3: -12.4, 4: -12.1, 5: -11.2, 6: -8.9, 7: -7.5, 8: -7.5, 9: -6.0, 10: -6.0, 11: -4.7, 12: -4.1, 13: -3.8, 14: -3.4, 15: -3.4, 16: -1.9, 17: -1.5, 18: -1.1, 19: -0.4, 20: -0.1, 21: 3.5, 22: 3.8, 23: 5.3, 24: 5.3, 25: 5.3, 26: 5.3, 27: 5.3, 28: 5.3, 29: 5.3, 30: 5.3, 31: 5.3, 32: 6.4, 33: 6.8, 34: 6.8, 35: 10.2, 36: 10.3, 37: 11.9, 38: 12.1, 39: 14.4, 40: 15.6, 41: 19.2, 42: 19.6}})

final = df1.merge(df2, on="x")

Upvotes: 0

Answers (3)

tdelaney

Reputation: 77367

The problem is that x values are not unique, so the merge duplicates rows to get all of the combinations. In a simple example

>>> import pandas as pd
>>> df1=pd.DataFrame({"a":[1,2,3,2], "b":['a', 'b', 'c', 'd']})
>>> df2=pd.DataFrame({"a":[1,2,3,2], "c":['aa', 'bb', 'cc', 'dd']})
>>> df1.merge(df2, on='a')
   a  b   c
0  1  a  aa
1  2  b  bb
2  2  b  dd
3  2  d  bb
4  2  d  dd
5  3  c  cc

2 is not unique in the column and gets all of the combinations (notice b --> dd and d --> dd).

In your case, the x column is identical in the two dataframes. This would also mean that indexes haven't changed and you could assign the columns you want to df1.

df1["delta y"] = df2["delta y"]

Upvotes: 1

Montreal

Reputation: 15

Try the following syntax and I encourage you to thoroughly read the official documentation of python, the link is given at the bottom. I think you might have different x values in df1 and df2 and they are not 100% identical. This could be perhaps because of the decimals.

import pandas as pd

left = pd.DataFrame(
   {
       "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
     }
 )


right = pd.DataFrame(
     {
        "key": ["K0", "K1", "K2", "K3"],
         "C": ["C0", "C1", "C2", "C3"],
         "D": ["D0", "D1", "D2", "D3"],
     }
 )
 

result = pd.merge(left, right, on="key")

Result Image

Python Merge,Join, Concatenate Official Guide

Upvotes: 1