jakes
jakes

Reputation: 2085

Matches not found by pd.DataFrame.merge

I've got a three pd.DataFrames:

df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})

df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})

df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})

I need to stack df1 and df2 together, then join them by left join with df3 based on multiple variables: var1, var2, var5.

So I wrote:

pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])

but it doesn't find all the matching rows. Changing the type to outer join we can observe there's is for example two rows with the same values of var1, var2 and var3 - rows 11th and 28th, but they haven't been joined:

pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'outer', on = ['var1', 'var2', 'var5'])

I'm struggling to find a reason for that behaviour. I thought maybe data types are different within joining columns, but no - they are the same. I'm relatively new to Pandas, so maybe I'm missing something obvious here? What is the reason for that (unexpected) behaviour?

Upvotes: 0

Views: 49

Answers (2)

braintho
braintho

Reputation: 401

df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})

df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})

df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})

pd.concat([df1, df2], axis = 0).dtypes

results in

var1     int64
var2     int64
var3     int64
var4     int64
var5    object
dtype: object

As you can see after the concat the var5 is an object. If you merge at this point you will get no results as var5 in df3 is a float.

Here is what I would recommend:

df1['var5'] = df1['var5'].astype(float)
df2['var5'] = df2['var5'].astype(float)
df3['var5'] = df3['var5'].astype(float)
pd.concat([df1, df2], axis = 0).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])

This will produce the following DataFrame:

    var1  var2  var3  var4      var5  var6  var7
0   2210     1     0     0  121160.0   NaN   NaN
1   2210     2     0     0   20066.0   NaN   NaN
2   2210     1     0     0   58621.0   NaN   NaN
3   2210     2     0     0  201084.0   NaN   NaN
4   2210     1     0     0  100180.0   NaN   NaN
5   2210     2     0     0   74230.0   NaN   NaN
6   2210     1     0     0   27789.0   NaN   NaN
7   2210     2     0     0   66975.0   NaN   NaN
8   2210     1     0     0   57410.0   NaN   NaN
9   2210     2     0     0   49413.0   NaN   NaN
10  2210     1     0     0   57112.0   NaN   NaN
11  2210     2     0     0   19188.0  19.0   8.0
12  2210     1     0     0   61366.0  95.0  40.0
13  2210     2     0     0   27341.0  75.0   0.0
14  2210     1     0     0   59859.0  56.0  42.0
15  2210     2     0     0  173954.0   NaN   NaN
16  2210     1     0     0  205651.0  22.0  52.0
17  2210     2     0     0   54861.0   NaN   NaN
18  2210     1     0     0  165809.0  80.0  50.0
19  2210     2     0     0   60252.0   NaN   NaN
20  2210     1     0     0  182156.0   NaN   NaN
21  2210     2     0     0   82403.0   NaN   NaN
22  2210     2    19     8   58786.0   NaN   NaN
23  2210     1    22    52   37572.0   NaN   NaN
24  2210     1    56    42  103955.0   NaN   NaN
25  2210     2    75     0   19197.0  95.0  17.0
26  2210     1    80    50   14664.0   NaN   NaN
27  2210     2    95    17   75773.0   NaN   NaN

Upvotes: 1

pslessard
pslessard

Reputation: 71

When I ran your code on my computer, then used df#.dtypes to get the types, the dtype of the var5 column in df1 is object, whereas in df2 and df3 it's float64. The concat runs fine with this (and after the concat, the dtype is object), but when I tried to run the merge (outer or left), I got a ValueError:

ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat

I'd suggest double checking the types again (I know you already checked that). If they really are the same on your computer, I'm not sure what's going on.

Upvotes: 0

Related Questions