Reputation: 1
I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy. Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.
I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.
Any ideas on how to improve?
Many thanks!
Update link to files: https://fromsmash.com/STLz.VEub2-ct
import pandas as pd
from fuzzywuzzy import process, fuzz
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
for i in companies.Customers:
ratio = process.extract(i, correct.Correct, limit=1)
actual_comp.append(ratio[0][0])
similarity.append(ratio[0][1])
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
Upvotes: 0
Views: 776
Reputation: 3265
There are a couple of things you can change to improve the performance:
Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)
The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct
len(companies.Customers)
times, which costs a lot of time and could be done once in front of the loop instead
Your only using the best match, so it is better to use process.extractOne
instead of process.extract
. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance
The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.
import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt
correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")
actual_comp = []
similarity = []
company_mapping = {company: utils.default_process(company) for company in correct.Correct}
for customer in companies.Customers:
_, score, comp = process.extractOne(
utils.default_process(customer),
company_mapping,
processor=None)
actual_comp.append(comp)
similarity.append(score)
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity)
companies.head(10)
Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).
Upvotes: 1