Reputation: 3456
I'm trying to compare two lists of strings and produce similarity metrics between both lists. The lists are of unequal length, one is roughly 50,000 the other is about 3,000.
But here are two MWE data frames that are similar to my data:
forbes = pd.DataFrame(
{
"company_name": [
"Deloitte",
"PriceWaterhouseCoopers",
"KPMG",
"Ernst & Young",
"intentionall typo company XYZ",
],
"revenue": [100, 200, 300, 250, 400],
}
)
sf = pd.DataFrame(
{"salesforce_name": ["Deloite", "PriceWaterhouseCooper"], "CEO": ["John", "Jane"]}
)
Here's how I produce similarity metrics. I'm using two for loops to calculate the similarity between each possible combination of strings, which with my actual data is: 50,000 * 3,000 = 150 million. That's a lot and I feel like there's a smarter way of doing this but don't know what that is.
Here's my implementation:
from fuzzywuzzy import fuzz
scores = pd.DataFrame()
for friend in forbes["company_name"]:
for address in sf["salesforce_name"]:
r = fuzz.ratio(friend, address) # Levenshtein distance
pr = fuzz.partial_ratio(friend, address) # partial ratio
tsr = fuzz.token_sort_ratio(friend, address) #
tser = fuzz.token_set_ratio(friend, address) # ignores duplicated words
if r > 80 or pr > 80 or tsr > 80 or tser > 80:
scores = pd.concat(
[
scores,
pd.DataFrame.from_records(
[
{
"forbes_company": friend,
"salesforce_company": address,
"r": r,
"pr": pr,
"tsr": tsr,
"tser": tser,
}
]
),
]
)
Upvotes: 2
Views: 1751
Reputation: 3265
I addition to the improvement proposed by @Laurent, you can replace your usage of fuzzywuzzy
with rapidfuzz
. This is as simply as replacing your import with:
from rapidfuzz import fuzz
On my machine this achieves the following runtime:
implementation | time |
---|---|
original implementation | 6m 16s |
Laurents implementation | 55s |
original implementation + rapidfuzz | 2m 48s |
Laurents implementation + rapidfuzz | 8s |
Upvotes: 2
Reputation: 13518
With the following toy dataframes (1,000 rows of random words each):
from random import choice
import pandas as pd
import requests
response = requests.get("https://www.mit.edu/~ecprice/wordlist.10000")
WORDS = [word.decode("utf-8") for word in response.content.splitlines()]
fbs = pd.DataFrame(
{
"forbes_name": [choice(WORDS) for _ in range(5_000)],
}
)
sf = pd.DataFrame({"salesforce_name": [choice(WORDS) for _ in range(5_000)]})
Your code outputs the following dataframe in 157 seconds (average on 10 runs) on my machine:
print
forbes_name salesforce_name ratio token_set_ratio token_sort_ratio \
0 salad salad 100 100 100
1 scientist scientist 100 100 100
2 scientist scientist 100 100 100
3 person personal 86 86 86
4 analyst analyst 100 100 100
...
partial_ratio
0 100
1 100
2 100
3 100
4 100
...
One way to speed up things is to not evaluate all ratios at once, but successively, in a recursive fashion, on a reduced dataset after each run, so that fuzz.ratio (the fastest one) is calculated on all possible combinations, then fuzz.token_set_ratio only on combinations < 80 with fuzz.ratio, etc.
Once the similar rows have been identified, you just have to fill in the missing metrics on them (instead of all possible combinations).
To do that, you can define the following helper functions:
from fuzzywuzzy import fuzz
def get_similar_values(s, other_s, t):
funcs = [
fuzz.partial_ratio,
fuzz.token_sort_ratio,
fuzz.token_set_ratio,
fuzz.ratio,
] # ordered from slowest (partial_ratio) to fastest (ratio, which will run first)
def evaluate(funcs, series, other_series, threshold, scores=None):
if not funcs:
return scores
scores = scores or []
idx = []
other_idx = []
func = funcs.pop()
for i, val in enumerate(series):
for j, other_val in enumerate(other_series):
if (score := func(val, other_val)) > threshold:
scores.append(
{
series.name: val,
other_series.name: other_val,
func.__name__: score,
}
)
idx.append(i)
other_idx.append(j)
series = series.drop(idx).reset_index(drop=True)
other_series = other_series.drop(other_idx).reset_index(drop=True)
return evaluate(funcs, series, other_series, threshold, scores)
return evaluate(funcs, s, other_s, t)
def add_missing_metrics(similar_values):
for func in [
fuzz.token_set_ratio,
fuzz.token_sort_ratio,
fuzz.partial_ratio,
fuzz.ratio,
]:
for one_pair in similar_values:
if func.__name__ not in one_pair.keys():
one_pair[func.__name__] = func(
one_pair["forbes_name"], one_pair["salesforce_name"]
)
return similar_values
Then:
similar_values = get_similar_values(fbs["forbes_name"], sf["salesforce_name"], 80)
similar_values = add_missing_metrics(similar_values)
df = pd.DataFrame.from_records(similar_values)
This produces the same dataframe as before, but in 58 seconds (average on 10 runs), which is ~2.5 times faster.
Upvotes: 2