Reputation: 1739
I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks
Upvotes: 1
Views: 965
Reputation: 1770
You can simply use pandas' .isin()
method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7
Upvotes: 1
Reputation: 57421
You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7
Upvotes: 1
Reputation: 221524
You can use NumPy's setdiff1d
, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d
could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d
would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5
Upvotes: 3
Reputation: 862581
You can use merge
with parameter indicator
and then boolean indexing
. Last you can drop
column _merge
:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0
Upvotes: 2