L H
L H

Reputation: 27

Fuzzy Match 2 Large Pandas Dataframes

I have 2 pandas dataframes that both contain company names. I want to left join df1(~10k rows) with df2(~1.6m rows) on company names using a fuzzy match. My current function takes too long to run, so I would like to know if there's any efficient way to do so?

I have tried fuzzy_merge, rapidfuzz + process.extractOne(). Is there a way to modify my current function? Or even better, another package to use(but not fuzzy-matcher because my laptop just couldn't allow it for no reason...)

Some sample data from df1 and df2.

df1 (10k rows)

df1_ID Company Name
AB0091 Apple
AC0092 Microsoft

df2 (1.6m rows)

df2_ID Entity Name
F001ABC Appl
F001AKC Appl Inc
F001AJQ Apple Farm
E002ABG The microst
E002ACG Minecraft
E004ABG Microsoft Inc

Ideal output (After fuzzy match and select the highest matched record)

df1_ID Company Name df2_ID Entity Name Matching Score
AB0091 Apple F001ABC Appl 91
AC0092 Microsoft E004ABG Microsoft Inc 100

The following codes are the function I'm using right now

def fuzzy_match(row, df2, key, threshold=70):
    best_match = process.extractOne(row['Company Name'], df2['Entity Name'], score_cutoff=threshold)
    if best_match:
        matched_id = df2.loc[df2['Entity Name'] == best_match[0], 'df2_ID'].values[0]
        return pd.Series([best_match[0], matched_id, best_match[1]])
    return pd.Series([None, None, None])

new_df[['Matched_Company', 'df2_ID', 'Name_Matching_Score']] = df1.apply(fuzzy_match, axis=1, df2=df2, key='Company Name')

I have asked the question before, and that's how I got the above function. But really appreciate if there's any more efficient method to solve my problem. Thank you. How to do fuzzy merge with 2 large pandas dataframes?

Upvotes: 0

Views: 48

Answers (0)

Related Questions