string comparison for multiple values python

Question

I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.

For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.

I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.

I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.

Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.

Patrick Haugh · Accepted Answer

It looks like the process.extractOne function is what you're looking for. A simple use case is something like

from fuzzywuzzy import process
from collections import defaultdict

complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']

group = defaultdict(list)   

for name in complicated_names:
    group[process.extractOne(name, generic_names)[0]].append(name)

defaultdict is a dictionary that has default values for all keys.

We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.

string comparison for multiple values python

Answers (1)

Related Questions