Aaditya R Krishnan
Aaditya R Krishnan

Reputation: 505

How to check if strings in two list are almost equal using python

I'm trying to find the strings in two list that almost match. Suppose there are two list as below

string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']

string_list_2 =
['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']

Output
Similar = ['apple_from_2018','samsung_from_2017','htc_from_2015','lenovo_decommision_2017']
Not Similar =['nokia_from_2010','moto_from_2019']

I tried above one using below implementation but it is not giving proper result

similar = []
not_similar = []
for item1 in string_list_1:
   for item2 in string_list_2:
      if SequenceMatcher(a=item1,b=item2).ratio() > 0.90:
         similar.append(item1)
      else:
          not_similar.append(item1)
  

When I tried above implementation it is not as expected. It would be appreciated if someone could identify the missing part and to get required result

Upvotes: 0

Views: 1232

Answers (1)

Tanishq Vyas
Tanishq Vyas

Reputation: 1689

You may make use of the following function in order to find similarity between two given strings

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()


print(similar("apple_from_2018", "apple_from_2020"))

Output :

0.8666666666666667

Thus using this function you may select the strings which cross the threshold value of percentage similarity. Although you may need to reduce your threshold from 90 to maybe 85 in order to get the expected output.

Thus the following code should work fine for you

string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']

string_list_2 = ['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']



from difflib import SequenceMatcher


similar = []
not_similar = []
for item1 in string_list_1:

    # Set the state as false
    found = False
    for item2 in string_list_2:
        if SequenceMatcher(None, a=item1,b=item2).ratio() > 0.80:
            similar.append(item1)
            found = True
            break
    
    if not found:
        not_similar.append(item1)

print("Similar : ", similar)
print("Not Similar : ", not_similar)

Output :

Similar :  ['apple_from_2018', 'samsung_from_2017', 'htc_from_2015', 'lenovo_decommision_2017']
Not Similar :  ['nokia_from_2010', 'moto_from_2019']

This does cut down on the amount of time and redundant appends. Also I have reduced the similarity measure to 80 since 90 was too high. But feel free to tweak the values.

Upvotes: 4

Related Questions