How to do fuzzy merge with 2 large pandas dataframes?

Question

I have 2 pandas dataframes that both contain company names. I want to merge these 2 dataframes on company names using a fuzzy match. But the problem is 1 dataframe contains 5m rows and the other 1 contains about 10k rows, so it takes forever for my fuzzy match to run. I would like to know if there's any efficient way to do so?

These are the codes I'm using right now:

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()

m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
df_1['matches'] = m

m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2

return df_1

And these are some sample data from df1 and df2.

df1

df1_ID	Company Name
AB0091	Apple
AC0092	Microsoft

df2

df2_ID	Company Name
F001ABC	Appl
E002ABG	The microst

As you can see the company names may include some typo and differences in 2 dataframes, and there's no other column I can use to do the merge, so that's why I need a fuzzy match on company names. The end goal is to efficiently use company name to match these 2 large dataframes.

Thank you!

How to do fuzzy merge with 2 large pandas dataframes?

Answers (1)

Related Questions