ixaixim
ixaixim

Reputation: 83

Pandas: find matching rows in two dataframes (without using `merge`)

Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:

tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp) 

tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)

print(df)
   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

print(df2)
     a    b    c
0    3    4    5
1  100    4  100
2  101  100    3

I want to verify if the rows of df2 are matching any rows of df, that is I want to obtain a series (or an array) of boolean values that gives this result:

0     True
1    False
2    False
dtype: bool

I think something like the isin method should work, but I got this result, which results in a dataframe and is wrong:

print(df2.isin(df))
       a      b      c
0  False  False  False
1  False  False  False
2  False  False  False

As a constraint, I wish to not use the merge method, since what I am doing is in fact a check on the data before applying merge itself. Thank you for your help!

Upvotes: 2

Views: 3288

Answers (5)

sophocles
sophocles

Reputation: 13831

You can use numpy.isin, which will compare all elements in your arrays and return True or False for each element for each array.

Then using all() on each array, will get your desired output as the function returns True if all elements are true:

>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])

0     True
1    False
2    False
dtype: bool

Breakdown of what is happening:

# np.isin
>>> np.isin(df2.values,df.values)

Out[139]: 
array([[ True,  True,  True],
       [False,  True, False],
       [False, False,  True]])

# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]

Out[140]: [True, False, False]

# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])

Out[141]: 
0     True
1    False
2    False
dtype: bool

Upvotes: 5

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

There may be more efficient solutions, but you could append the two dataframes can call duplicated, e.g.:

df.append(df2).duplicated().iloc[df.shape[0]:]

This assumes that all rows in each DataFrame are distinct. Here are some benchmarks:

tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1,  columns=["a", "b", "c"]) 

tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)

df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()

%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28709

You can use a MultiIndex (expensive IMO):

pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])

Another option is to create a dictionary, and run isin:

df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]: 
0     True
1    False
2    False
dtype: bool

Upvotes: 1

Corralien
Corralien

Reputation: 120469

Use np.in1d:

>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0     True
1    False
2    False
dtype: bool

Another way, use frozenset:

>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0     True
1    False
2    False
dtype: bool

Upvotes: 1

Girolamo
Girolamo

Reputation: 311

Try:

df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]

Here is my result:

Result

Upvotes: 0

Related Questions