Reputation: 22438
Given a single row from dataframe X
, what is the most efficient way to retrieve all rows from dataframe Y
that completely match the query row?
Example: querying row [0,1,0,1]
from
[
[0,1,0,1, 1.0],
[0,1,0,1, 2.0],
[0,1,0,0, 3.0],
[1,1,0,0, 0.5],
]
should return
[
[0,1,0,1, 1.0],
[0,1,0,1, 2.0],
]
X
and Y
are assumed to have the same schema, except that Y
has an additional target value column. There may be one, zero, or many matches. The solution should be efficient even with thousands of columns.
Upvotes: 3
Views: 3534
Reputation: 164653
One efficient way is to drop down to numpy
and query individual columns:
Data from @jezrael.
import pandas as pd, numpy as np
df = pd.DataFrame({'A':list('abadef'),
'B':[4,5,4,5,5,4],
'C':[7,8,7,4,2,3],
'D':[1,3,1,7,1,0],
'E':[5,3,5,9,2,4],
'F':list('aaabbb')})
vals = df.values
arr = [4, 7, 1, 5]
mask = np.logical_and.reduce([vals[:, i+1]==arr[i] for i in range(len(arr))])
res = df.iloc[np.where(mask)[0]]
print(res)
# A B C D E F
# 0 a 4 7 1 5 a
# 2 a 4 7 1 5 a
Upvotes: 1
Reputation: 862591
Use boolean indexing
:
L = [
[0,1,0,1, 1.0],
[0,1,0,1, 2.0],
[0,1,0,0, 3.0],
[1,1,0,0, 0.5],
]
df = pd.DataFrame(L)
Y = [0,1,0,1]
print (df[df.iloc[:, :len(Y)].eq(Y).all(axis=1)])
0 1 2 3 4
0 0 1 0 1 1.0
1 0 1 0 1 2.0
Explanation:
First select first N
columns by length of sequence:
print (df.iloc[:, :len(Y)])
0 1 2 3
0 0 1 0 1
1 0 1 0 1
2 0 1 0 0
3 1 1 0 0
Compare all rows by first row selected by eq
and loc
:
print (df.iloc[:, :len(Y)].eq(Y))
0 1 2 3
0 True True True True
1 True True True True
2 True True True False
3 False True True False
And check if match by DataFrame.all
for check all True
s per row:
print (df.iloc[:, :len(Y)].eq(Y).all(1))
0 True
1 True
2 False
3 False
dtype: bool
Upvotes: 2