Reputation: 3257

Python - Matching strings from 2 lists

I have 2 lists. Actual and Predicted. I need to compare both lists and determine the number of fuzzy matches. The reason I say fuzzy matches is due to the fact that they will not be the exact same. I am using the SequenceMatcher from the difflib library.

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

I can assume that strings with a percentage match of above 80% are considered to be the same. Example Lists

actual=[ "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

I need a way to pick out that Apple, Peace and Orange in the predicted list has been found in the actual list. So only 3 matches have been made and not 5 matches. How do I do this efficiently?

Upvotes: 0

Answers (8)

ash

Reputation: 1

You can also try the following approach to achieve your requirement:

import itertools

fuzlist = [ "Appl", "Orange", "Ornge", "Peace"]
actlist = ["Red", "Apple", "Green", "Peace", "Orange"]
foundlist = []
for fuzname in fuzlist:
    for name in actlist:
        for actname in itertools.permutations(name):
            if fuzname.lower() in ''.join(actname).lower():
                foundlist.append(name)
                break

print set(foundlist)

Upvotes: 0

TTT

Reputation: 317

{x[1] for x in itertools.product(actual, predicted) if similar(*x) > 0.80}

Upvotes: 1

omri_saadon

Reputation: 10621

You can turn both of the lists to sets and apply intersection on them.

That will give you three items {'Peace', 'Apple', 'Orange'}.

Than, you can calculate the ratio within the result set len to the actual list len.

actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

res = set(actual).intersection(predicted)

print (res)
print ((len(res) / len(actual)) * 100)

Edit:

In order to use the ratio you will need to implement nested loop. As set is implemented as a hash table so search is O(1), I would prefer to use the actual as a set.

If the predicted is in the actual (Exact match) so just add it to your result set. (best case is that all like that and final complexity is O(n)).

If the predicted is not in actual, loop though the actual and find whether a ratio over 0.8 is exist. (worst case is that all are like that, complexity (On^2))

actual={"Appl", "Orange", "Ornge", "Peace"}
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

result = {}

for pre in predicted:
    if pre in actual:
        result.add(pre)
    else:
        for act in actual:
            if (similar(pre, act) > 0.8):
                result.add(pre)

Upvotes: 1

rodgdor

Reputation: 2630

First take the intersection of the two sets:

actual, predicted = set(actual), set(predicted)

exact = actual.intersection(predicted)

If this comprises all your actual words then you're done. However,

if len(exact) < len(actual):
    fuzzy = [word for word in actual-predicted for match in predicted if similar(word, match)>0.8]

Finally your resulting set is exact.union(set(fuzzy))

Upvotes: 0

Filip Happy

Reputation: 624

Simple approach, but NOT effective, would be:

counter = 0
for item in b:
    if SequenceMatcher(None, a, item).ratio() > 0:
        counter += 1

This is what you want, the number of fuzzy matched elements, not only the same elements (as offered by most other answers).

Upvotes: 0

Neeraj Meshram

Reputation: 3

I this case you only have to check if i'th element of predicted list is present in actual list or not. if present, then add to new list.

In [2]: actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
...: predicted=["Red", "Apple", "Green", "Peace", "Orange"]


In [3]: [i for i in predicted if i in actual]
Out[3]: ['Apple', 'Peace', 'Orange']

Upvotes: 0

athul.sure

Reputation: 328

You can use the following set comprehension to get the desired output using your similar method if fuzzy matching is indeed what you're looking for.

threshold = 0.8
result = {x for x in predicted for y in actual if similar(x, y) > threshold}

Upvotes: 3

Saket Mittal

Reputation: 3876

>>> actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
>>> predicted=["Red", "Apple", "Green", "Peace", "Orange"]
>>> set(actual) & set(predicted)
set(['Orange', 'Peace', 'Apple'])

Upvotes: 0

Python - Matching strings from 2 lists

Answers (8)

Related Questions