user34624
user34624

Reputation: 1

Fuzzywuzzy match 2 columns... script keeps running

I'm trying to match 2 columns of ~50.000 instances with Fuzzywuzzy. Column A (companies) contains company names, with some typos. Column B (correct) contains the correct company names.

I'm trying to match the typo ones with correct ones. When running my script below, the kernel keeps executing for hours & doesn't provide a result.

Any ideas on how to improve?

Many thanks!

Update link to files: https://fromsmash.com/STLz.VEub2-ct

import pandas as pd
from fuzzywuzzy import process, fuzz 
import matplotlib.pyplot as plt 

correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")

actual_comp = []
similarity = []

for i in companies.Customers: 
    ratio = process.extract(i, correct.Correct, limit=1)
    actual_comp.append(ratio[0][0])
    similarity.append(ratio[0][1])
    
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity) 

companies.head(10)

Upvotes: 0

Views: 776

Answers (1)

maxbachmann
maxbachmann

Reputation: 3265

There are a couple of things you can change to improve the performance:

  1. Use Rapidfuzz instead of Fuzzywuzzy, since it implements the same algorithms, but is quite a bit faster (I am the author)

  2. The process functions are preprocessing all strings you pass to them (lowercases them, removes non alpha numeric characters and trims whitespaces). Right now your preprocessing correct.Correct len(companies.Customers) times, which costs a lot of time and could be done once in front of the loop instead

  3. Your only using the best match, so it is better to use process.extractOne instead of process.extract. This is more readable and inside extractOne rapidfuzz is using the results of previous comparision to improve the performance

The following snippet implements these changes for your code. Keep in mind, that your still performing 50k^2 comparisions, so while this should be a lot faster than your current solution it will still take a while.

import pandas as pd
from rapidfuzz import process, fuzz, utils
import matplotlib.pyplot as plt 

correct = pd.read_excel("correct.xlsx")
companies = pd.read_excel("companies2.xlsx")

actual_comp = []
similarity = []

company_mapping = {company: utils.default_process(company) for company in correct.Correct}

for customer in companies.Customers:
    _, score, comp = process.extractOne(
        utils.default_process(customer),
        company_mapping,
        processor=None)
    actual_comp.append(comp)
    similarity.append(score)
    
companies['actual_company'] = pd.Series(actual_comp)
companies['similarity'] = pd.Series(similarity) 

companies.head(10)

Out of interest I performed a quick benchmark calculating the average runtime when using your datasets. On my machine each lookup requires around 1 second with this solution (so a total of around 4.7 hours), while your previous solution took around 55 seconds per lookup (so a total of around 10.8 days).

Upvotes: 1

Related Questions