Reputation: 27
I have 2 pandas dataframes that both contain company names. I want to left join df1(~10k rows) with df2(~1.6m rows) on company names using a fuzzy match. My current function takes too long to run, so I would like to know if there's any efficient way to do so?
I have tried fuzzy_merge, rapidfuzz + process.extractOne(). Is there a way to modify my current function? Or even better, another package to use(but not fuzzy-matcher because my laptop just couldn't allow it for no reason...)
Some sample data from df1 and df2.
df1 (10k rows)
df1_ID | Company Name |
---|---|
AB0091 | Apple |
AC0092 | Microsoft |
df2 (1.6m rows)
df2_ID | Entity Name |
---|---|
F001ABC | Appl |
F001AKC | Appl Inc |
F001AJQ | Apple Farm |
E002ABG | The microst |
E002ACG | Minecraft |
E004ABG | Microsoft Inc |
Ideal output (After fuzzy match and select the highest matched record)
df1_ID | Company Name | df2_ID | Entity Name | Matching Score |
---|---|---|---|---|
AB0091 | Apple | F001ABC | Appl | 91 |
AC0092 | Microsoft | E004ABG | Microsoft Inc | 100 |
The following codes are the function I'm using right now
def fuzzy_match(row, df2, key, threshold=70):
best_match = process.extractOne(row['Company Name'], df2['Entity Name'], score_cutoff=threshold)
if best_match:
matched_id = df2.loc[df2['Entity Name'] == best_match[0], 'df2_ID'].values[0]
return pd.Series([best_match[0], matched_id, best_match[1]])
return pd.Series([None, None, None])
new_df[['Matched_Company', 'df2_ID', 'Name_Matching_Score']] = df1.apply(fuzzy_match, axis=1, df2=df2, key='Company Name')
I have asked the question before, and that's how I got the above function. But really appreciate if there's any more efficient method to solve my problem. Thank you. How to do fuzzy merge with 2 large pandas dataframes?
Upvotes: 0
Views: 48