Reputation: 987
I want to filter df1 based on a column of df2. I would only need to keep the rows in df1 if they appear in df2. I tried using isin()
like so:
df1 = pd.DataFrame({'A' : [5,6,3,6,3,4]})
df2 = pd.DataFrame({'B' : [0,0,3,6,0,0]})
df1[df1['A'].isin(df2['B'])]
Which gives the desired df:
A
6
3
6
3
However, my dataframes are very large (millions of rows) so this operation takes a significant amount of time. Are there other, more efficient ways to get the desired result?
Upvotes: 2
Views: 544
Reputation: 446
What if you try to left join and then filter out NAs. I just generated two somewhat large data frames (10 mil one and 4 mil another) and on a average laptop with 8GB RAM it ran in seconds. The example is below. Hope it helps.
df1 = pd.DataFrame({'A' : range(10000000), "B": range(0, 20000000, 2)})
df2 = pd.DataFrame({'C' : range(4000000), "D": range(0, 8000000, 2)})
df = pd.merge(df1, df2, how="left", left_on="B", right_on="C")
df = df[df["C"].notnull()].copy()
Upvotes: 2