Michail N
Michail N

Reputation: 3845

Calculate string matching percentage in a python list

I have a list with texts where I want to extract a percentage on how similar they are between [0,1]. Here is my code:

from difflib import SequenceMatcher

listA = ['aaa','sss','ba']
listB = ['aa','aa']

def compare_strings(mylist):
  if (len(mylist) < 2):
    return 0.00
  else:
    cnt = 0
    total = 0.0
    for i in range(len(mylist)): 
        for j in range(i + 1, len(mylist)): 
            val = SequenceMatcher(None, mylist[i], mylist[j]).ratio()
            total += val 
            cnt += 1
    return (total / cnt)

print( "Sting simalarity in list 1 is %.5f" % (compare_strings(listA)))
print( "Sting simalarity in list 2 is %.5f" % (compare_strings(listB)))
>>>
Sting simalarity in list 1 is 0.13333
Sting simalarity in list 2 is 1.00000

This code is functional but I don't like as it seems a little complicated. Is there a better or a more elegant way to solve this problem? Is there a way to express this with the lambda operator?

Upvotes: 2

Views: 3585

Answers (2)

ibarrond
ibarrond

Reputation: 7591

Here you have it, with one lambda function in a single line. Numpy mean is optional (you can implement your own mean)

from difflib import SequenceMatcher
import numpy as np
import itertools

listA = ['aaa','sss','ba']
listB = ['aa','aa']


similarity = lambda x: np.mean([SequenceMatcher(None, a,b).ratio() for a,b in itertools.combinations(x, 2)])

similarity(listA)
#> 0.13333333333333333
similarity(listB)
#> 1.0

Upvotes: 3

tobias_k
tobias_k

Reputation: 82899

You can use itertools.combinations to get all the combinations and then use sum, and calculate the number of combinations directly instead of counting them.

def compare_strings(mylist):
    if len(mylist) < 2: return 0.0
    total = sum(SequenceMatcher(None, a, b).ratio() for a, b in combinations(mylist, 2))
    cnt = (len(mylist) * (len(mylist)-1)) // 2
    return total / cnt

Upvotes: 1

Related Questions