Manu Sharma
Manu Sharma

Reputation: 1739

Outer join in python Pandas

I have two data sets as following

A         B
IDs      IDs
1        1
2        2
3        5
4        7

How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A Something like Following

B
Ids
5
7

I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following

pd.merge(A, B, on='ids', how='right')

Thanks

Upvotes: 1

Views: 965

Answers (4)

Alex Petralia
Alex Petralia

Reputation: 1770

You can simply use pandas' .isin() method:

df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]

If these are separate DataFrames:

a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]

Output:

   IDs
2    5
3    7

Upvotes: 1

Kurt Peek
Kurt Peek

Reputation: 57421

You could convert the data series to sets and take the difference:

import pandas as pd

df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])  
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)})   # Take difference and convert back to DataFrame 

The variable "C" then yields

   C
0  5
1  7

Upvotes: 1

Divakar
Divakar

Reputation: 221524

You can use NumPy's setdiff1d, like so -

np.setdiff1d(B['IDs'],A['IDs'])

Also, np.in1d could be used for the same effect, like so -

B[~np.in1d(B['IDs'],A['IDs'])]

Please note that np.setdiff1d would give us a sorted NumPy array as output.

Sample run -

>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
   IDs
1    7
2    5

Upvotes: 3

jezrael
jezrael

Reputation: 862581

You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:

A = pd.DataFrame({'IDs':[1,2,3,4],
                   'B':[4,5,6,7],
                   'C':[1,8,9,4]})
print (A)
   B  C  IDs
0  4  1    1
1  5  8    2
2  6  9    3
3  7  4    4

B = pd.DataFrame({'IDs':[1,2,5,7],
                   'A':[1,8,3,7],
                   'D':[1,8,9,4]})

print (B)
   A  D  IDs
0  1  1    1
1  8  8    2
2  3  9    5
3  7  4    7

df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']

df = df.drop('_merge', axis=1)
print (df)
    B   C  IDs    A    D
4 NaN NaN  5.0  3.0  9.0
5 NaN NaN  7.0  7.0  4.0

Upvotes: 2

Related Questions