user2205916
user2205916

Reputation: 3456

Fuzzywuzzy to compare two lists of strings of unequal length and save multiple similarity metrics

I'm trying to compare two lists of strings and produce similarity metrics between both lists. The lists are of unequal length, one is roughly 50,000 the other is about 3,000.

But here are two MWE data frames that are similar to my data:

forbes = pd.DataFrame(
    {
        "company_name": [
            "Deloitte",
            "PriceWaterhouseCoopers",
            "KPMG",
            "Ernst & Young",
            "intentionall typo company XYZ",
        ],
        "revenue": [100, 200, 300, 250, 400],
    }
)

sf = pd.DataFrame(
    {"salesforce_name": ["Deloite", "PriceWaterhouseCooper"], "CEO": ["John", "Jane"]}
)

Here's how I produce similarity metrics. I'm using two for loops to calculate the similarity between each possible combination of strings, which with my actual data is: 50,000 * 3,000 = 150 million. That's a lot and I feel like there's a smarter way of doing this but don't know what that is.

Here's my implementation:

from fuzzywuzzy import fuzz

scores = pd.DataFrame()
for friend in forbes["company_name"]:
    for address in sf["salesforce_name"]:
        r = fuzz.ratio(friend, address)  # Levenshtein distance
        pr = fuzz.partial_ratio(friend, address)  # partial ratio
        tsr = fuzz.token_sort_ratio(friend, address)  #
        tser = fuzz.token_set_ratio(friend, address)  # ignores duplicated words
        if r > 80 or pr > 80 or tsr > 80 or tser > 80:
            scores = pd.concat(
                [
                    scores,
                    pd.DataFrame.from_records(
                        [
                            {
                                "forbes_company": friend,
                                "salesforce_company": address,
                                "r": r,
                                "pr": pr,
                                "tsr": tsr,
                                "tser": tser,
                            }
                        ]
                    ),
                ]
            )

Upvotes: 2

Views: 1751

Answers (2)

maxbachmann
maxbachmann

Reputation: 3265

I addition to the improvement proposed by @Laurent, you can replace your usage of fuzzywuzzy with rapidfuzz. This is as simply as replacing your import with:

from rapidfuzz import fuzz

On my machine this achieves the following runtime:

implementation time
original implementation 6m 16s
Laurents implementation 55s
original implementation + rapidfuzz 2m 48s
Laurents implementation + rapidfuzz 8s

Upvotes: 2

Laurent
Laurent

Reputation: 13518

With the following toy dataframes (1,000 rows of random words each):

from random import choice

import pandas as pd
import requests


response = requests.get("https://www.mit.edu/~ecprice/wordlist.10000")
WORDS = [word.decode("utf-8") for word in response.content.splitlines()]

fbs = pd.DataFrame(
    {
        "forbes_name": [choice(WORDS) for _ in range(5_000)],
    }
)
sf = pd.DataFrame({"salesforce_name": [choice(WORDS) for _ in range(5_000)]})

Your code outputs the following dataframe in 157 seconds (average on 10 runs) on my machine:

print
  forbes_name salesforce_name  ratio  token_set_ratio  token_sort_ratio  \
0       salad           salad    100              100               100   
1   scientist       scientist    100              100               100   
2   scientist       scientist    100              100               100   
3      person        personal     86               86                86   
4     analyst         analyst    100              100               100 
...  

   partial_ratio  
0            100  
1            100  
2            100  
3            100  
4            100
...

One way to speed up things is to not evaluate all ratios at once, but successively, in a recursive fashion, on a reduced dataset after each run, so that fuzz.ratio (the fastest one) is calculated on all possible combinations, then fuzz.token_set_ratio only on combinations < 80 with fuzz.ratio, etc.

Once the similar rows have been identified, you just have to fill in the missing metrics on them (instead of all possible combinations).

To do that, you can define the following helper functions:

from fuzzywuzzy import fuzz


def get_similar_values(s, other_s, t):
    funcs = [
        fuzz.partial_ratio,
        fuzz.token_sort_ratio,
        fuzz.token_set_ratio,
        fuzz.ratio,
    ]  # ordered from slowest (partial_ratio) to fastest (ratio, which will run first)

    def evaluate(funcs, series, other_series, threshold, scores=None):
        if not funcs:
            return scores

        scores = scores or []
        idx = []
        other_idx = []
        func = funcs.pop()

        for i, val in enumerate(series):
            for j, other_val in enumerate(other_series):
                if (score := func(val, other_val)) > threshold:
                    scores.append(
                        {
                            series.name: val,
                            other_series.name: other_val,
                            func.__name__: score,
                        }
                    )
                    idx.append(i)
                    other_idx.append(j)
        series = series.drop(idx).reset_index(drop=True)
        other_series = other_series.drop(other_idx).reset_index(drop=True)
        return evaluate(funcs, series, other_series, threshold, scores)

    return evaluate(funcs, s, other_s, t)
def add_missing_metrics(similar_values):
    for func in [
        fuzz.token_set_ratio,
        fuzz.token_sort_ratio,
        fuzz.partial_ratio,
        fuzz.ratio,
    ]:
        for one_pair in similar_values:
            if func.__name__ not in one_pair.keys():
                one_pair[func.__name__] = func(
                    one_pair["forbes_name"], one_pair["salesforce_name"]
                )
    return similar_values

Then:

similar_values = get_similar_values(fbs["forbes_name"], sf["salesforce_name"], 80)
similar_values = add_missing_metrics(similar_values)
df = pd.DataFrame.from_records(similar_values)

This produces the same dataframe as before, but in 58 seconds (average on 10 runs), which is ~2.5 times faster.

Upvotes: 2

Related Questions