Reputation: 2304
How do you get a true outer join in pandas? Meaning that it actually gives you the entire output instead of combining the columns to merge on. This is kind of silly in my opinion because it then makes it hard to determine what kind of operation to do on a row. I do this all the time to detect whether I should Insert, Update, or Delete data, however I always have to create extra copies of the merge on columns, which is just a bunch of overhead (sometimes massive amounts) on certain data sets.
example:
import pandas as pd
keys = ["A","B"]
df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})
fullJoinDf = df1.merge(df2, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
display(
fullJoinDf,
)
A B C D C_r D_r
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
Noticed how it output A
& B
magically combined to a single set of columns. What I want is what I would get in SQL outerjoins etc like:
A B C D A_r B_r C_r D_r
0 1 one testThis NaN NaN NaN NaN NaN
1 2 two testThat -3.656526e+18 2 two testThis -9.136326e+18
2 3 three testThis -8.571400e+18 3 three testThat -8.571400e+18
3 NaN NaN NaN NaN 4 four testThis -4.190116e+17
Edit for @Felipe Whitaker
Using concat:
df3 = df1.copy().set_index(keys)
df4 = df2.copy().set_index(keys)
t = pd.concat([df3,df4], axis=1)
t.reset_index(),
A B C D C D
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
EDIT Examples* Given the answers I'm posting more tests, so anyone else who stumbles on this can see a few more of the "gatcha" variations I've discovered while doing this.
import pandas as pd
keys = ["A","B"]
df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})
df3 = df1.copy()
df4 = df2.copy()
df3.index = df3[keys]
df4.index = df4[keys]
df5 = df1.copy().set_index(keys)
df6 = df2.copy().set_index(keys)
fullJoinDf = df5.merge(df6, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
fullJoinDf_2 = df3.merge(df4, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
t = pd.concat([df1,df2], axis=1, keys=["A","B"])
display(
df3.index,
df5.index,
fullJoinDf,
fullJoinDf_2,
t,
)
Index([(1, 'one'), (2, 'two'), (3, 'three')], dtype='object')
MultiIndex([(1, 'one'),
(2, 'two'),
(3, 'three')],
names=['A', 'B'])
A B C D C_r D_r
0 1 one testThis NaN NaN NaN
1 2 two testThat -3.656526e+18 testThis -9.136326e+18
2 3 three testThis -8.571400e+18 testThat -8.571400e+18
3 4 four NaN NaN testThis -4.190116e+17
A B C D A_r B_r C_r D_r
(1, one) 1.0 one testThis NaN NaN NaN NaN NaN
(2, two) 2.0 two testThat -3.656526e+18 2.0 two testThis -9.136326e+18
(3, three) 3.0 three testThis -8.571400e+18 3.0 three testThat -8.571400e+18
(4, four) NaN NaN NaN NaN 4.0 four testThis -4.190116e+17
A B C D A B C D
0 1 one testThis NaN 2 two testThis -9136325526401183790
1 2 two testThat -3.656526e+18 3 three testThat -8571400026927442160
2 3 three testThis -8.571400e+18 4 four testThis -419011572131270498
Upvotes: 0
Views: 221
Reputation: 2811
if you rename the columns used in the merge in 1 of the DataFrames before the merge
it looks like it will give the correct answer
df1.merge(df2.rename({'A': 'A_y', 'B': 'B_y'}, axis =1), left_on=keys, right_on=['A_y', 'B_y'], how='outer')
#output:
A B C_x D_x A_y B_y C_y D_y
0 1.0 one testThis NaN NaN NaN NaN NaN
1 2.0 two testThat -2.482945e+18 2.0 two testThis -1.215774e+18
2 3.0 three testThis 1.140152e+17 3.0 three testThat 1.140152e+17
3 NaN NaN NaN NaN 4.0 four testThis -4.915382e+18
Upvotes: 2
Reputation: 93151
If you don't care abut the original indexes at all:
df1.index = df1[keys]
df2.index = df2[keys]
fullJoinDf = df1.merge(df2, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
Result:
A B C D A_r B_r C_r D_r
0 1.0 one testThis NaN NaN NaN NaN NaN
1 2.0 two testThat 6.368540e+18 2.0 two testThis -6.457388e+18
2 3.0 three testThis -7.490461e+18 3.0 three testThat -7.490461e+18
3 NaN NaN NaN NaN 4.0 four testThis 4.344649e+18
Upvotes: 3