Reputation: 3257
I have 2 lists. Actual and Predicted. I need to compare both lists and determine the number of fuzzy matches. The reason I say fuzzy matches is due to the fact that they will not be the exact same. I am using the SequenceMatcher from the difflib library.
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
I can assume that strings with a percentage match of above 80% are considered to be the same. Example Lists
actual=[ "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]
I need a way to pick out that Apple, Peace and Orange in the predicted list has been found in the actual list. So only 3 matches have been made and not 5 matches. How do I do this efficiently?
Upvotes: 0
Views: 4419
Reputation: 1
You can also try the following approach to achieve your requirement:
import itertools
fuzlist = [ "Appl", "Orange", "Ornge", "Peace"]
actlist = ["Red", "Apple", "Green", "Peace", "Orange"]
foundlist = []
for fuzname in fuzlist:
for name in actlist:
for actname in itertools.permutations(name):
if fuzname.lower() in ''.join(actname).lower():
foundlist.append(name)
break
print set(foundlist)
Upvotes: 0
Reputation: 317
{x[1] for x in itertools.product(actual, predicted) if similar(*x) > 0.80}
Upvotes: 1
Reputation: 10621
You can turn both of the lists to sets and apply intersection on them.
That will give you three items {'Peace', 'Apple', 'Orange'}
.
Than, you can calculate the ratio within the result set len to the actual list len.
actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]
res = set(actual).intersection(predicted)
print (res)
print ((len(res) / len(actual)) * 100)
Edit:
In order to use the ratio you will need to implement nested loop. As set is implemented as a hash table so search is O(1), I would prefer to use the actual as a set.
If the predicted is in the actual (Exact match) so just add it to your result set. (best case is that all like that and final complexity is O(n)).
If the predicted is not in actual, loop though the actual and find whether a ratio over 0.8 is exist. (worst case is that all are like that, complexity (On^2))
actual={"Appl", "Orange", "Ornge", "Peace"}
predicted=["Red", "Apple", "Green", "Peace", "Orange"]
result = {}
for pre in predicted:
if pre in actual:
result.add(pre)
else:
for act in actual:
if (similar(pre, act) > 0.8):
result.add(pre)
Upvotes: 1
Reputation: 2630
First take the intersection of the two sets:
actual, predicted = set(actual), set(predicted)
exact = actual.intersection(predicted)
If this comprises all your actual words then you're done. However,
if len(exact) < len(actual):
fuzzy = [word for word in actual-predicted for match in predicted if similar(word, match)>0.8]
Finally your resulting set is exact.union(set(fuzzy))
Upvotes: 0
Reputation: 624
Simple approach, but NOT effective, would be:
counter = 0
for item in b:
if SequenceMatcher(None, a, item).ratio() > 0:
counter += 1
This is what you want, the number of fuzzy matched elements, not only the same elements (as offered by most other answers).
Upvotes: 0
Reputation: 3
I this case you only have to check if i'th element of predicted list is present in actual list or not. if present, then add to new list.
In [2]: actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
...: predicted=["Red", "Apple", "Green", "Peace", "Orange"]
In [3]: [i for i in predicted if i in actual]
Out[3]: ['Apple', 'Peace', 'Orange']
Upvotes: 0
Reputation: 328
You can use the following set comprehension to get the desired output using your similar
method if fuzzy matching is indeed what you're looking for.
threshold = 0.8
result = {x for x in predicted for y in actual if similar(x, y) > threshold}
Upvotes: 3
Reputation: 3876
>>> actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
>>> predicted=["Red", "Apple", "Green", "Peace", "Orange"]
>>> set(actual) & set(predicted)
set(['Orange', 'Peace', 'Apple'])
Upvotes: 0