Fluxy
Fluxy

Reputation: 2978

How to find rows of one dataframe in another dataframe?

I have two pandas dataframes df1 and df2. I need to select indices of rows in df1, whose (specific) column values are included in df2.

This is my code:

selected_rows = []

for i, rowi in df1.iterrows():
  for j, rowj in df2.iterrows():
    if (rowi['COL1']==rowj[COL1']) & (rowi['COL2']==rowj['COL2']):
      selected_rows.append(i)

The problem is that df1 is large, and the nested for.loop takes a lot of time. Is there any alternative solution, e.g. using masks?

Sample data:

data1 = [['Alex',10,1],['Bob',12,1],['Clarke',13,4]]
df1 = pd.DataFrame(data1,columns=['COL1','COL2','COL3'])

data2 = [['Ted',10],['Bob',12],['Clarke',13]]
df2 = pd.DataFrame(data2,columns=['COL1','COL2'])

The expected answer is: rows 1 and 2 (staring the count from 0).

Upvotes: 0

Views: 3415

Answers (3)

Andrea
Andrea

Reputation: 3077

Here you have some possibilities with the timing.

Here's the input dataframes:

np.random.seed(42)
df1 = pd.DataFrame({'COL1':np.random.randint(0,30,100), 'COL2':np.random.randint(0,30,100)})
df2 = pd.DataFrame({'COL1':np.random.randint(0,30,100), 'COL2':np.random.randint(0,30,100)})

@Fluxy method:

%%timeit
selected_rows = []

for i, rowi in df1.iterrows():
    for j, rowj in df2.iterrows():
        if (rowi['COL1'] == rowj['COL1']) & (rowi['COL2'] == rowj['COL2']):
            selected_rows.append(i)

result:

693 ms ± 8.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@FBruzzesi method:

%%timeit
selected_rows = []

for j, rowj in df2.iterrows():
    selected_rows.append(df1[(df1['COL1'] == rowj['COL1']) & (df1['COL2'] == rowj['COL2'])].index.values)

# Convert to list of numbers
selected_rows = [np.asscalar(x) for x in selected_rows if len(x)]
selected_rows

result:

87.3 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

My suggestion (still not optimal):

%%timeit
selected_rows = [index for index, row in enumerate(df1[['COL1', 'COL2']].values.tolist())
                    if row in df2[['COL1', 'COL2']].values.tolist()]

result:

60.4 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The @luigigi method is the best by far:

%%timeit

mask = (df1[['COL1','COL2']].isin(df2[['COL1','COL2']])).all(axis=1)
selected_rows  = list(df1[mask].index)

result:

2.91 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

luigigi
luigigi

Reputation: 4215

Use this:

mask = (df1[['COL1','COL2']].isin(df2[['COL1','COL2']])).all(axis=1)
df1[mask]

     COL1  COL2  COL3
1     Bob    12     1
2  Clarke    13     4

selected_rows  = list(df1[mask].index)
[1, 2]

Upvotes: 4

FBruzzesi
FBruzzesi

Reputation: 6505

This solution will cycle only once and generate a list of integers. The only problem may occur if a row in df2 matches with more than one in df1

selected_rows = []

for j, rowj in df2.iterrows():
    selected_rows.append(df1[(df1['COL1']==rowj['COL1']) & (df1['COL2']==rowj['COL2'])].index.values)

# Convert to list of numbers
selected_rows = [np.asscalar(x) for x in selected_rows if len(x)]

Upvotes: 1

Related Questions