Filter dataframe based on column of other df without isin()

Question

I want to filter df1 based on a column of df2. I would only need to keep the rows in df1 if they appear in df2. I tried using isin() like so:

df1 = pd.DataFrame({'A' : [5,6,3,6,3,4]})

df2 = pd.DataFrame({'B' : [0,0,3,6,0,0]})

df1[df1['A'].isin(df2['B'])]

Which gives the desired df:

However, my dataframes are very large (millions of rows) so this operation takes a significant amount of time. Are there other, more efficient ways to get the desired result?

griggy · Accepted Answer

What if you try to left join and then filter out NAs. I just generated two somewhat large data frames (10 mil one and 4 mil another) and on a average laptop with 8GB RAM it ran in seconds. The example is below. Hope it helps.

df1 = pd.DataFrame({'A' : range(10000000), "B": range(0, 20000000, 2)})
df2 = pd.DataFrame({'C' : range(4000000), "D": range(0, 8000000, 2)})
df = pd.merge(df1, df2, how="left", left_on="B", right_on="C")
df = df[df["C"].notnull()].copy()

Filter dataframe based on column of other df without isin()

Answers (1)

Related Questions