Fuzzy Match 2 Large Pandas Dataframes

Question

I have 2 pandas dataframes that both contain company names. I want to left join df1(~10k rows) with df2(~1.6m rows) on company names using a fuzzy match. My current function takes too long to run, so I would like to know if there's any efficient way to do so?

I have tried fuzzy_merge, rapidfuzz + process.extractOne(). Is there a way to modify my current function? Or even better, another package to use(but not fuzzy-matcher because my laptop just couldn't allow it for no reason...)

Some sample data from df1 and df2.

df1 (10k rows)

df1_ID	Company Name
AB0091	Apple
AC0092	Microsoft

df2 (1.6m rows)

df2_ID	Entity Name
F001ABC	Appl
F001AKC	Appl Inc
F001AJQ	Apple Farm
E002ABG	The microst
E002ACG	Minecraft
E004ABG	Microsoft Inc

Ideal output (After fuzzy match and select the highest matched record)

df1_ID	Company Name	df2_ID	Entity Name	Matching Score
AB0091	Apple	F001ABC	Appl	91
AC0092	Microsoft	E004ABG	Microsoft Inc	100

The following codes are the function I'm using right now

def fuzzy_match(row, df2, key, threshold=70):
    best_match = process.extractOne(row['Company Name'], df2['Entity Name'], score_cutoff=threshold)
    if best_match:
        matched_id = df2.loc[df2['Entity Name'] == best_match[0], 'df2_ID'].values[0]
        return pd.Series([best_match[0], matched_id, best_match[1]])
    return pd.Series([None, None, None])

new_df[['Matched_Company', 'df2_ID', 'Name_Matching_Score']] = df1.apply(fuzzy_match, axis=1, df2=df2, key='Company Name')

I have asked the question before, and that's how I got the above function. But really appreciate if there's any more efficient method to solve my problem. Thank you. How to do fuzzy merge with 2 large pandas dataframes?

Fuzzy Match 2 Large Pandas Dataframes

Answers (0)

Related Questions