Reputation: 4100
Background
I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities.
Sample Data:
vendor_df=
Name of Vendor City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
regulator_df =
Name of Entity Committies
LACKEY SHEET METAL Private
PRIMUS STERILIZER COMPANY LLC Private
HELGET GAS PRODUCTS INC Autonomous
ORTHOQUEST LLC Governmant
Problem Stmt:
I have to fuzzy match the entities of these two(Name of vendor
& Name of Entity
) columns and get a score. So, need to know if 1st value of dataframe 1(vendor_df
) is matching with any of the 2000 entities of dataframe2(regulator_df).
StackOverflow Links I checked:
fuzzy match between 2 columns (Python)
create new column in dataframe using fuzzywuzzy
Apply fuzzy matching across a dataframe column and save results in a new column
Code
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)
regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)
compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
regulator_df['Name of Entity']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
#compare.apply(metrics) -- Either this works or the below line
result = compare.apply(metrics).unstack().idxmax().unstack(0)
Problems with Above Code:
The code works if the two dataframes are small but it is taking forever when I give the complete dataset. Above code is taken from 3rd link.
Any solution if the same thing can work fast or can work with large dataset?
UPDATE 1
Can the above code be made faster if we pass or hard-code a score say 80 which will filter series/dataframe only with fuzzyscore > 80 ?
Upvotes: 5
Views: 9134
Reputation: 177
I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:
Vesrion Compatibility:
pandas version :: 1.1.5 ,
numpy vesrion:: 1.19.5,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0
Fuzzywuzzy metric explanation: link text
Upvotes: 2
Reputation: 1520
in my case also i need to look for only above 80. i modified your code as per my use case.hope it helps.
compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]
Upvotes: 1
Reputation: 4100
Below solution is faster than what I posted but if someone has a more faster approach please tell:
matched_vendors = []
for row in vendor_df.index:
vendor_name = vendor_df.get_value(row,"Name of vendor")
for columns in regulator_df.index:
regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
if matched_token> 80:
matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])
Upvotes: 3