Jamie Marshall
Jamie Marshall

Reputation: 2304

pandas true outer join?

How do you get a true outer join in pandas? Meaning that it actually gives you the entire output instead of combining the columns to merge on. This is kind of silly in my opinion because it then makes it hard to determine what kind of operation to do on a row. I do this all the time to detect whether I should Insert, Update, or Delete data, however I always have to create extra copies of the merge on columns, which is just a bunch of overhead (sometimes massive amounts) on certain data sets.

example:

import pandas as pd

keys = ["A","B"]

df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})

fullJoinDf = df1.merge(df2, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
display(
    fullJoinDf,
)

    A   B       C           D               C_r          D_r
0   1   one     testThis    NaN             NaN          NaN
1   2   two     testThat    -3.656526e+18   testThis    -9.136326e+18
2   3   three   testThis    -8.571400e+18   testThat    -8.571400e+18
3   4   four    NaN         NaN             testThis    -4.190116e+17

Noticed how it output A & B magically combined to a single set of columns. What I want is what I would get in SQL outerjoins etc like:

    A    B      C           D               A_r  B_r     C_r        D_r
0   1    one    testThis    NaN             NaN  NaN     NaN        NaN     
1   2    two    testThat    -3.656526e+18   2    two     testThis   -9.136326e+18
2   3    three  testThis    -8.571400e+18   3    three   testThat   -8.571400e+18
3   NaN  NaN    NaN         NaN             4    four    testThis   -4.190116e+17

Edit for @Felipe Whitaker

Using concat:

df3 = df1.copy().set_index(keys)
df4 = df2.copy().set_index(keys)
t = pd.concat([df3,df4], axis=1)
t.reset_index(), 

    A   B       C           D               C           D
0   1   one     testThis    NaN             NaN         NaN
1   2   two     testThat    -3.656526e+18   testThis    -9.136326e+18
2   3   three   testThis    -8.571400e+18   testThat    -8.571400e+18
3   4   four    NaN         NaN             testThis    -4.190116e+17

EDIT Examples* Given the answers I'm posting more tests, so anyone else who stumbles on this can see a few more of the "gatcha" variations I've discovered while doing this.

import pandas as pd

keys = ["A","B"]

df1 = pd.DataFrame({"A":[1,2,3],"B":["one","two","three"],"C":["testThis","testThat", "testThis"],"D":[None,hash("B"),hash("C")]})
df2 = pd.DataFrame({"A":[2,3,4],"B":["two","three","four"],"C":["testThis","testThat", "testThis"], "D":[hash("G"),hash("C"),hash("D")]})

df3 = df1.copy()
df4 = df2.copy()
df3.index = df3[keys]
df4.index = df4[keys]

df5 = df1.copy().set_index(keys)
df6 = df2.copy().set_index(keys)


fullJoinDf = df5.merge(df6, how="outer", left_on=keys, right_on=keys, suffixes=["","_r"])
fullJoinDf_2 = df3.merge(df4, how="outer", left_index=True, right_index=True, suffixes=["","_r"])
t = pd.concat([df1,df2], axis=1, keys=["A","B"])
display(
    df3.index,
    df5.index,
    fullJoinDf,
    fullJoinDf_2,
    t,
)

Index([(1, 'one'), (2, 'two'), (3, 'three')], dtype='object')
MultiIndex([(1,   'one'),
            (2,   'two'),
            (3, 'three')],
           names=['A', 'B'])

    A   B       C           D               C_r         D_r
0   1   one     testThis    NaN             NaN         NaN
1   2   two     testThat    -3.656526e+18   testThis    -9.136326e+18
2   3   three   testThis    -8.571400e+18   testThat    -8.571400e+18
3   4   four    NaN         NaN             testThis    -4.190116e+17

            A    B      C           D               A_r  B_r    C_r        D_r
(1, one)    1.0  one    testThis    NaN             NaN  NaN    NaN        NaN
(2, two)    2.0  two    testThat    -3.656526e+18   2.0  two    testThis    -9.136326e+18
(3, three)  3.0  three  testThis    -8.571400e+18   3.0  three  testThat    -8.571400e+18
(4, four)   NaN  NaN    NaN         NaN             4.0  four   testThis    -4.190116e+17

    A   B       C           D               A   B       C           D
0   1   one     testThis    NaN             2   two     testThis    -9136325526401183790
1   2   two     testThat    -3.656526e+18   3   three   testThat    -8571400026927442160
2   3   three   testThis    -8.571400e+18   4   four    testThis    -419011572131270498

Upvotes: 0

Views: 221

Answers (2)

Terry
Terry

Reputation: 2811

if you rename the columns used in the merge in 1 of the DataFrames before the merge it looks like it will give the correct answer

df1.merge(df2.rename({'A': 'A_y', 'B': 'B_y'}, axis =1), left_on=keys, right_on=['A_y', 'B_y'], how='outer')
#output:
    A   B       C_x         D_x             A_y     B_y     C_y         D_y
0   1.0 one     testThis    NaN             NaN     NaN     NaN         NaN
1   2.0 two     testThat    -2.482945e+18   2.0     two     testThis    -1.215774e+18
2   3.0 three   testThis    1.140152e+17    3.0     three   testThat    1.140152e+17
3   NaN NaN     NaN         NaN             4.0     four    testThis    -4.915382e+18

Upvotes: 2

Code Different
Code Different

Reputation: 93151

If you don't care abut the original indexes at all:

df1.index = df1[keys]
df2.index = df2[keys]

fullJoinDf = df1.merge(df2, how="outer", left_index=True, right_index=True, suffixes=["","_r"])

Result:

     A      B         C             D  A_r    B_r       C_r           D_r
0  1.0    one  testThis           NaN  NaN    NaN       NaN           NaN
1  2.0    two  testThat  6.368540e+18  2.0    two  testThis -6.457388e+18
2  3.0  three  testThis -7.490461e+18  3.0  three  testThat -7.490461e+18
3  NaN    NaN       NaN           NaN  4.0   four  testThis  4.344649e+18

Upvotes: 3

Related Questions