Reputation: 2085
I've got a three pd.DataFrames
:
df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})
df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})
df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})
I need to stack df1
and df2
together, then join them by left join with df3
based on multiple variables: var1, var2, var5
.
So I wrote:
pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])
but it doesn't find all the matching rows. Changing the type to outer join we can observe there's is for example two rows with the same values of var1, var2
and var3
- rows 11th and 28th, but they haven't been joined:
pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'outer', on = ['var1', 'var2', 'var5'])
I'm struggling to find a reason for that behaviour. I thought maybe data types are different within joining columns, but no - they are the same. I'm relatively new to Pandas, so maybe I'm missing something obvious here? What is the reason for that (unexpected) behaviour?
Upvotes: 0
Views: 49
Reputation: 401
df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})
df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})
df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})
pd.concat([df1, df2], axis = 0).dtypes
results in
var1 int64
var2 int64
var3 int64
var4 int64
var5 object
dtype: object
As you can see after the concat the var5 is an object. If you merge at this point you will get no results as var5 in df3 is a float.
Here is what I would recommend:
df1['var5'] = df1['var5'].astype(float)
df2['var5'] = df2['var5'].astype(float)
df3['var5'] = df3['var5'].astype(float)
pd.concat([df1, df2], axis = 0).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])
This will produce the following DataFrame:
var1 var2 var3 var4 var5 var6 var7
0 2210 1 0 0 121160.0 NaN NaN
1 2210 2 0 0 20066.0 NaN NaN
2 2210 1 0 0 58621.0 NaN NaN
3 2210 2 0 0 201084.0 NaN NaN
4 2210 1 0 0 100180.0 NaN NaN
5 2210 2 0 0 74230.0 NaN NaN
6 2210 1 0 0 27789.0 NaN NaN
7 2210 2 0 0 66975.0 NaN NaN
8 2210 1 0 0 57410.0 NaN NaN
9 2210 2 0 0 49413.0 NaN NaN
10 2210 1 0 0 57112.0 NaN NaN
11 2210 2 0 0 19188.0 19.0 8.0
12 2210 1 0 0 61366.0 95.0 40.0
13 2210 2 0 0 27341.0 75.0 0.0
14 2210 1 0 0 59859.0 56.0 42.0
15 2210 2 0 0 173954.0 NaN NaN
16 2210 1 0 0 205651.0 22.0 52.0
17 2210 2 0 0 54861.0 NaN NaN
18 2210 1 0 0 165809.0 80.0 50.0
19 2210 2 0 0 60252.0 NaN NaN
20 2210 1 0 0 182156.0 NaN NaN
21 2210 2 0 0 82403.0 NaN NaN
22 2210 2 19 8 58786.0 NaN NaN
23 2210 1 22 52 37572.0 NaN NaN
24 2210 1 56 42 103955.0 NaN NaN
25 2210 2 75 0 19197.0 95.0 17.0
26 2210 1 80 50 14664.0 NaN NaN
27 2210 2 95 17 75773.0 NaN NaN
Upvotes: 1
Reputation: 71
When I ran your code on my computer, then used df#.dtypes
to get the types, the dtype of the var5
column in df1
is object
, whereas in df2
and df3
it's float64
. The concat runs fine with this (and after the concat, the dtype is object
), but when I tried to run the merge (outer or left), I got a ValueError:
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
I'd suggest double checking the types again (I know you already checked that). If they really are the same on your computer, I'm not sure what's going on.
Upvotes: 0