Reputation: 27
I have 2 pandas dataframes that both contain company names. I want to merge these 2 dataframes on company names using a fuzzy match. But the problem is 1 dataframe contains 5m rows and the other 1 contains about 10k rows, so it takes forever for my fuzzy match to run. I would like to know if there's any efficient way to do so?
These are the codes I'm using right now:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
And these are some sample data from df1 and df2.
df1
df1_ID | Company Name |
---|---|
AB0091 | Apple |
AC0092 | Microsoft |
df2
df2_ID | Company Name |
---|---|
F001ABC | Appl |
E002ABG | The microst |
As you can see the company names may include some typo and differences in 2 dataframes, and there's no other column I can use to do the merge, so that's why I need a fuzzy match on company names. The end goal is to efficiently use company name to match these 2 large dataframes.
Thank you!
Upvotes: 0
Views: 398
Reputation: 11522
One possible approach is using rapidfuzz
and the process.extractOne()
method in the following manner:
import pandas as pd
from rapidfuzz import process
data1 = {
'df1_ID': ['AB0091', 'AC0092'],
'Company Name': ['Apple', 'Microsoft']
}
df1 = pd.DataFrame(data1)
data2 = {
'df2_ID': ['F001ABC', 'E002ABG'],
'Company Name': ['Appl', 'The microst']
}
df2 = pd.DataFrame(data2)
def fuzzy_match(row, df2, key, threshold=70):
best_match = process.extractOne(row['Company Name'], df2['Company Name'], score_cutoff=threshold)
if best_match:
matched_id = df2.loc[df2['Company Name'] == best_match[0], 'df2_ID'].values[0]
return pd.Series([best_match[0], matched_id, best_match[1]])
return pd.Series([None, None, None])
df1[['Matched Company', 'Matched df2_ID', 'Match Score']] = df1.apply(fuzzy_match, axis=1, df2=df2, key='Company Name')
print(df1)
which will return:
df1_ID Company Name Matched Company Matched df2_ID Match Score
0 AB0091 Apple Appl F001ABC 88.888889
1 AC0092 Microsoft None None NaN
The output is totally dependent on the threshold you choose. 90 did not match anything, while 70 matched Apple. For a match on Microsoft, you would need to go lower (50),
df1_ID Company Name Matched Company Matched df2_ID Match Score
0 AB0091 Apple Appl F001ABC 88.888889
1 AC0092 Microsoft The microst E002ABG 60.000000
but you may get unreasonable matches on other things:
Upvotes: 0