Reputation: 2978
I have two pandas dataframes df1
and df2
. I need to select indices of rows in df1
, whose (specific) column values are included in df2
.
This is my code:
selected_rows = []
for i, rowi in df1.iterrows():
for j, rowj in df2.iterrows():
if (rowi['COL1']==rowj[COL1']) & (rowi['COL2']==rowj['COL2']):
selected_rows.append(i)
The problem is that df1
is large, and the nested for.loop takes a lot of time. Is there any alternative solution, e.g. using masks?
Sample data:
data1 = [['Alex',10,1],['Bob',12,1],['Clarke',13,4]]
df1 = pd.DataFrame(data1,columns=['COL1','COL2','COL3'])
data2 = [['Ted',10],['Bob',12],['Clarke',13]]
df2 = pd.DataFrame(data2,columns=['COL1','COL2'])
The expected answer is: rows 1 and 2 (staring the count from 0).
Upvotes: 0
Views: 3415
Reputation: 3077
Here you have some possibilities with the timing.
Here's the input dataframes:
np.random.seed(42)
df1 = pd.DataFrame({'COL1':np.random.randint(0,30,100), 'COL2':np.random.randint(0,30,100)})
df2 = pd.DataFrame({'COL1':np.random.randint(0,30,100), 'COL2':np.random.randint(0,30,100)})
@Fluxy method:
%%timeit
selected_rows = []
for i, rowi in df1.iterrows():
for j, rowj in df2.iterrows():
if (rowi['COL1'] == rowj['COL1']) & (rowi['COL2'] == rowj['COL2']):
selected_rows.append(i)
result:
693 ms ± 8.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@FBruzzesi method:
%%timeit
selected_rows = []
for j, rowj in df2.iterrows():
selected_rows.append(df1[(df1['COL1'] == rowj['COL1']) & (df1['COL2'] == rowj['COL2'])].index.values)
# Convert to list of numbers
selected_rows = [np.asscalar(x) for x in selected_rows if len(x)]
selected_rows
result:
87.3 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
My suggestion (still not optimal):
%%timeit
selected_rows = [index for index, row in enumerate(df1[['COL1', 'COL2']].values.tolist())
if row in df2[['COL1', 'COL2']].values.tolist()]
result:
60.4 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The @luigigi method is the best by far:
%%timeit
mask = (df1[['COL1','COL2']].isin(df2[['COL1','COL2']])).all(axis=1)
selected_rows = list(df1[mask].index)
result:
2.91 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2
Reputation: 4215
Use this:
mask = (df1[['COL1','COL2']].isin(df2[['COL1','COL2']])).all(axis=1)
df1[mask]
COL1 COL2 COL3
1 Bob 12 1
2 Clarke 13 4
selected_rows = list(df1[mask].index)
[1, 2]
Upvotes: 4
Reputation: 6505
This solution will cycle only once and generate a list of integers. The only problem may occur if a row in df2
matches with more than one in df1
selected_rows = []
for j, rowj in df2.iterrows():
selected_rows.append(df1[(df1['COL1']==rowj['COL1']) & (df1['COL2']==rowj['COL2'])].index.values)
# Convert to list of numbers
selected_rows = [np.asscalar(x) for x in selected_rows if len(x)]
Upvotes: 1