Reputation: 2665
How can I do a left outer join, excluding the intersection, in Pandas?
I have 2 pandas dataframes
df1 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'accounting', 'IT'], 'col2' : ['az', 'bh', '', '', '']})
df2 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'finance', 'finance'], 'col2' : ['', 'az', '', '', '']})
df1
col1 col2
0 finance az
1 finance bh
2 finance
3 accounting
4 IT
df2
col1 col2
0 finance
1 finance az
2 finance
3 finance
4 finance
As you can see the dataframe has blank values as well. I tried using the example and its not giving me the result I want.
common = df1.merge(df2,on=['col1','col2'])
df3=df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
I want the output to look something like
col1 col2
3 accounting
4 IT
Upvotes: 8
Views: 16169
Reputation: 53
A one liner for this based on Bin's answer may be:
df=pd.merge(df1,df2[['col1']],on=['col1'],how="outer",indicator=True).query('_merge=="left_only"')
Upvotes: 4
Reputation: 3834
Pandas left outer join exclusion can be achieved by setting pandas merge's indicator=True. Then filter by the indicator in _merge column.
df=pd.merge(df1,df2[['col1']],on=['col1'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']
# this following line is just formating
df = df.reset_index()[['col1', 'col2']]
Output:
col1 col2
0 accounting
1 IT
====The following is an example showing the mechanism====
df1 = pd.DataFrame({'key1': ['0', '1'],
'key2': [-1, -1],
'A': ['A0', 'A1'],
})
df2 = pd.DataFrame({'key1': ['0', '1'],
'key2': [1, -1],
'B': ['B0', 'B1']
})
:
df1
Output:
A key1 key2
0 A0 0 -1
1 A1 1 -1
:
df2
Output:
B key1 key2
0 B0 0 1
1 B1 1 -1
:
df=pd.merge(df1,df2,on=['key1','key2'],how="outer",indicator=True)
:
Output:
A key1 key2 B _merge
0 A0 0 -1 NaN left_only
1 A1 1 -1 B1 both
2 NaN 0 1 B0 right_only
:With the above indicators in the _merge
column. you can select rows in one dataframe but not in another.
df=df[df['_merge']=='left_only']
df
Output:
A key1 key2 B _merge
0 A0 0 -1 NaN left_only
Upvotes: 14
Reputation: 2369
This fails because you're independently checking for a match in col1
& col2
, and excluding a match on either. The empty strings match the empty strings in the finance
rows.
You'd want:
df3 = df1[(~df1.col1.isin(common.col1))|(~df1.col2.isin(common.col2))]
df3
Out[150]:
col1 col2
1 finance bh
3 accounting
4 IT
To get the rows in df1
not in df2
.
To get specifically
df3
col1 col2
3 accounting
4 IT
you might try just selecting those with a non-matching col1
.
df3 = df1[~df1.col1.isin(df2.col1)]
df3
Out[172]:
col1 col2
3 accounting
4 IT
To independently check for a match in col1
& col2
and exclude a match on either while having NaN
s compare unequal/never count as a match, you could use
df3 = df1[(~df1.col1.isin(common.col1)|df1.col1.isnull())&(~df1.col2.isin(common.col2)|df1.col2.isnull())]
df3
Out[439]:
col1 col2
3 accounting NaN
4 IT NaN
assuming you're working with actual NaN
s, either None
or np.nan
, in your actual data, instead of empty strings as in this example. If the latter, you'll need to add
df1.replace('', np.nan, inplace=True)
df2.replace('', np.nan, inplace=True)
first.
Upvotes: 0