Reputation: 83
Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:
tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp)
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
print(df2)
a b c
0 3 4 5
1 100 4 100
2 101 100 3
I want to verify if the rows of df2
are matching any rows of df
, that is I want to obtain a series (or an array) of boolean values that gives this result:
0 True
1 False
2 False
dtype: bool
I think something like the isin
method should work, but I got this result, which results in a dataframe and is wrong:
print(df2.isin(df))
a b c
0 False False False
1 False False False
2 False False False
As a constraint, I wish to not use the merge
method, since what I am doing is in fact a check on the data before applying merge itself.
Thank you for your help!
Upvotes: 2
Views: 3288
Reputation: 13831
You can use numpy.isin
, which will compare all elements in your arrays and return True
or False
for each element for each array.
Then using all()
on each array, will get your desired output as the function returns True
if all elements are true:
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
0 True
1 False
2 False
dtype: bool
Breakdown of what is happening:
# np.isin
>>> np.isin(df2.values,df.values)
Out[139]:
array([[ True, True, True],
[False, True, False],
[False, False, True]])
# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]
Out[140]: [True, False, False]
# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])
Out[141]:
0 True
1 False
2 False
dtype: bool
Upvotes: 5
Reputation: 11602
There may be more efficient solutions, but you could append the two dataframes can call duplicated
, e.g.:
df.append(df2).duplicated().iloc[df.shape[0]:]
This assumes that all rows in each DataFrame are distinct. Here are some benchmarks:
tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1, columns=["a", "b", "c"])
tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)
df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()
%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop
Upvotes: 1
Reputation: 28709
You can use a MultiIndex (expensive IMO):
pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])
Another option is to create a dictionary, and run isin
:
df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]:
0 True
1 False
2 False
dtype: bool
Upvotes: 1
Reputation: 120469
Use np.in1d
:
>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0 True
1 False
2 False
dtype: bool
Another way, use frozenset
:
>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0 True
1 False
2 False
dtype: bool
Upvotes: 1
Reputation: 311
Try:
df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]
Here is my result:
Upvotes: 0