Reputation: 505
I'm trying to find the strings in two list that almost match. Suppose there are two list as below
string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']
string_list_2 =
['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']
Output
Similar = ['apple_from_2018','samsung_from_2017','htc_from_2015','lenovo_decommision_2017']
Not Similar =['nokia_from_2010','moto_from_2019']
I tried above one using below implementation but it is not giving proper result
similar = []
not_similar = []
for item1 in string_list_1:
for item2 in string_list_2:
if SequenceMatcher(a=item1,b=item2).ratio() > 0.90:
similar.append(item1)
else:
not_similar.append(item1)
When I tried above implementation it is not as expected. It would be appreciated if someone could identify the missing part and to get required result
Upvotes: 0
Views: 1232
Reputation: 1689
You may make use of the following function in order to find similarity between two given strings
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
print(similar("apple_from_2018", "apple_from_2020"))
Output :
0.8666666666666667
Thus using this function you may select the strings which cross the threshold value of percentage similarity. Although you may need to reduce your threshold from 90 to maybe 85 in order to get the expected output.
Thus the following code should work fine for you
string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']
string_list_2 = ['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']
from difflib import SequenceMatcher
similar = []
not_similar = []
for item1 in string_list_1:
# Set the state as false
found = False
for item2 in string_list_2:
if SequenceMatcher(None, a=item1,b=item2).ratio() > 0.80:
similar.append(item1)
found = True
break
if not found:
not_similar.append(item1)
print("Similar : ", similar)
print("Not Similar : ", not_similar)
Output :
Similar : ['apple_from_2018', 'samsung_from_2017', 'htc_from_2015', 'lenovo_decommision_2017']
Not Similar : ['nokia_from_2010', 'moto_from_2019']
This does cut down on the amount of time and redundant appends. Also I have reduced the similarity measure to 80 since 90 was too high. But feel free to tweak the values.
Upvotes: 4