StatguyUser
StatguyUser

Reputation: 2665

Pandas left outer join exclusion

How can I do a left outer join, excluding the intersection, in Pandas?

I have 2 pandas dataframes

df1 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'accounting', 'IT'], 'col2' : ['az', 'bh', '', '', '']}) 
df2 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'finance', 'finance'], 'col2' : ['', 'az', '', '', '']})

df1

    col1    col2
0   finance az
1   finance bh
2   finance 
3   accounting  
4   IT  

df2

    col1    col2
0   finance 
1   finance az
2   finance 
3   finance 
4   finance 

As you can see the dataframe has blank values as well. I tried using the example and its not giving me the result I want.

common = df1.merge(df2,on=['col1','col2'])
df3=df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]

I want the output to look something like

    col1    col2
3   accounting  
4   IT  

Upvotes: 8

Views: 16169

Answers (3)

Aidan Wood
Aidan Wood

Reputation: 53

A one liner for this based on Bin's answer may be:

df=pd.merge(df1,df2[['col1']],on=['col1'],how="outer",indicator=True).query('_merge=="left_only"')

Upvotes: 4

Bin
Bin

Reputation: 3834

Pandas left outer join exclusion can be achieved by setting pandas merge's indicator=True. Then filter by the indicator in _merge column.

df=pd.merge(df1,df2[['col1']],on=['col1'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']
# this following line is just formating
df = df.reset_index()[['col1', 'col2']] 

Output:

col1    col2
0   accounting  
1   IT  

==================================

====The following is an example showing the mechanism====

df1 = pd.DataFrame({'key1': ['0', '1'],
                     'key2': [-1, -1],
                     'A': ['A0', 'A1'],
                     })


df2 = pd.DataFrame({'key1': ['0', '1'],
                      'key2': [1, -1], 
                    'B': ['B0', 'B1']
                     })

:

df1

Output:

    A   key1    key2
0   A0  0       -1
1   A1  1       -1

:

df2

Output:

    B   key1    key2
0   B0  0       1
1   B1  1       -1

:

df=pd.merge(df1,df2,on=['key1','key2'],how="outer",indicator=True)

:

Output:

     A  key1    key2    B   _merge
0   A0  0   -1  NaN left_only
1   A1  1   -1  B1  both
2   NaN 0   1   B0  right_only

:With the above indicators in the _merge column. you can select rows in one dataframe but not in another.

df=df[df['_merge']=='left_only']
df

Output:

    A   key1    key2    B   _merge
0   A0  0   -1  NaN left_only

Upvotes: 14

EFT
EFT

Reputation: 2369

This fails because you're independently checking for a match in col1 & col2, and excluding a match on either. The empty strings match the empty strings in the finance rows.

You'd want:

df3 = df1[(~df1.col1.isin(common.col1))|(~df1.col2.isin(common.col2))]
df3
Out[150]: 
         col1 col2
1     finance   bh
3  accounting     
4          IT  

To get the rows in df1 not in df2 .

To get specifically

df3
    col1    col2
3   accounting  
4   IT  

you might try just selecting those with a non-matching col1.

df3 = df1[~df1.col1.isin(df2.col1)]
df3
Out[172]: 
         col1 col2
3  accounting     
4          IT

To independently check for a match in col1 & col2 and exclude a match on either while having NaNs compare unequal/never count as a match, you could use

df3 = df1[(~df1.col1.isin(common.col1)|df1.col1.isnull())&(~df1.col2.isin(common.col2)|df1.col2.isnull())]
df3
Out[439]: 
         col1 col2
3  accounting  NaN
4          IT  NaN

assuming you're working with actual NaNs, either None or np.nan, in your actual data, instead of empty strings as in this example. If the latter, you'll need to add

df1.replace('', np.nan, inplace=True)
df2.replace('', np.nan, inplace=True)

first.

Upvotes: 0

Related Questions