user4881368
user4881368

Reputation:

Performing joins in Pandas

I am familiar with performing joind in pandas

pd.merge(A,B,on='key',how='inner')

But how do I write a query in pandas for these 3 given joins as they require a IF NULL enter image description here

Upvotes: 3

Views: 2137

Answers (2)

clemens
clemens

Reputation: 6813

Although the answer by piRSquared is a very nice one, here is another way of doing it:

import pandas as pd

Create DataFrames A and B:

A = pd.DataFrame({'key': range(1, 6), 'A': ['a'] * 5})
B = pd.DataFrame({'key': range(3, 8), 'B': ['b'] * 5})

The solution for example A (i.e. left excluding join): First perform a left join and then keep only columns of A that don't have a corresponding row in B:

pd.merge(A, B, on = 'key', how = 'left')[~A.key.isin(B.key)]

  key   A   B
0   1   a   NaN
1   2   a   NaN

The solution for example B (i.e. right excluding join): Very similar to solution A, but with a right join:

pd.merge(A, B, on = 'key', how = 'right')[~B.key.isin(A.key)]
    key A   B
3   6   NaN b
4   7   NaN b

The solution for example C (i.e. outer excluding join): First perform an full outer join:

outer = pd.merge(A, B, on = 'key', how = 'outer')

And then filter rows in A and B that do not have a corresponding key in B and A:

outer[outer.key.isin(list(A.key[~A.key.isin(B.key)]) + list(B.key[~B.key.isin(A.key)]))]

   key  A   B
0   1   a   NaN
1   2   a   NaN
5   6   NaN b
6   7   NaN b

Upvotes: 1

piRSquared
piRSquared

Reputation: 294218

Setup

A = pd.DataFrame(dict(key=range(0, 5), col1=list('abcde')))
B = pd.DataFrame(dict(key=range(2, 7), col2=list('vwxyz')))

print(A, B, sep='\n' * 2)

  col1  key
0    a    0
1    b    1
2    c    2
3    d    3
4    e    4

  col2  key
0    v    2
1    w    3
2    x    4
3    y    5
4    z    6

Using pd.DataFrame.merge

The most direct way to do this is to use the indicator parameter.

A.merge(B, 'outer', indicator=True)

  col1  key col2      _merge
0    a    0  NaN   left_only
1    b    1  NaN   left_only
2    c    2    v        both
3    d    3    w        both
4    e    4    x        both
5  NaN    5    y  right_only
6  NaN    6    z  right_only

Then we can use pd.DataFrame.query

A - B

A.merge(B, 'outer', indicator=True).query('_merge == "left_only"')

  col1  key col2     _merge
0    a    0  NaN  left_only
1    b    1  NaN  left_only

B - A

A.merge(B, 'outer', indicator=True).query('_merge == "right_only"')

  col1  key col2      _merge
5  NaN    5    y  right_only
6  NaN    6    z  right_only

Symmetric Difference

A.merge(B, 'outer', indicator=True).query('_merge != "both"')

  col1  key col2      _merge
0    a    0  NaN   left_only
1    b    1  NaN   left_only
5  NaN    5    y  right_only
6  NaN    6    z  right_only

Using pd.Series.isin (mostly)

However, I'd like to simply use pd.Series.isin as a boolean mask.

A - B

A[~A.key.isin(B.key)]

  col1  key
0    a    0
1    b    1

B - A

B[~B.key.isin(A.key)]

  col2  key
3    y    5
4    z    6

Symmetric Difference

A[~A.key.isin(B.key)].append(B[~B.key.isin(A.key)])

Or

A.append(B).drop_duplicates('key', keep=False)

  col1 col2  key
0    a  NaN    0
1    b  NaN    1
3  NaN    y    5
4  NaN    z    6

Upvotes: 5

Related Questions