Reputation: 11
I am trying to figure out if my string is similar to any list of items in list. My error is that it only iterates up to my list size, not for the length of my singular string. Any suggestions?
my_string = "aplpe"
my_list=["orange", "apple", "grape"]
correctamount=0
if(my_string in my_list):
print("Passed")
else:
if any(my_string in s for s in my_list):
for i in range(len(my_string) + 1):
if my_string[i] == my_list[i][i]:
correctamount += 1
print(correctamount)
else:
correctamount == 0
print(correctamount)
if((correctamount/len(my_list) + 1 ) > .75):
print("Passed")
else:
print("Failure")
Upvotes: 1
Views: 2975
Reputation: 10960
To find similarity between strings there are many kinds of algorithms, Python has a library called textdistance which has all the algorithms.
The one I am going to use is Jaccard distance according to your requirements. You need to decide on the algorithm based on your needs.
import textdistance as td
similarity_perc = [td.jaccard.normalized_similarity(my_string, s) for s in my_list]
Similarity percentage for each string
[0.22, 1.0, 0.42]
Get the index of the most similar string
most_similar_index = similarity_perc.index(max(similarity_perc))
# Omitted not found check. Please do it yourself.
print(my_list[most_similar_index])
Output
apple
A benchmark of textdistance with other libraries is given here if you are looking to use this for a large dataset.
Upvotes: 2
Reputation: 17408
There's a library called jellyfish
for this purpose - https://github.com/jamesturk/jellyfish
>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
The library has different algorithms of string matching
Levenshtein Distance
Damerau-Levenshtein Distance
Jaro Distance
Jaro-Winkler Distance
Match Rating Approach Comparison
Hamming Distance
Upvotes: 3
Reputation: 5311
I believe in writing a clean code and separating out individual functionality, so that the code is easy to read and to contribute.
I defined a function is_similar
to check the similarity_percentage
.
Have a look at the following implementation:
import math
def is_similar(my_string, test_string):
min_len = min(len(my_string), len(test_string))
count = 0
for i in range(0, min_len):
if(my_string[i] == test_string[i]):
count = count+1
similarity_percentage = count/len(my_string)
print("Similarity Precentage: ", similarity_percentage)
return ( similarity_percentage > 0.75 )
my_string = "aplpe"
my_list=["orange", "apple", "grape"]
if(my_string in my_list):
print("Passed - Identical")
else:
for i in range(0, len(my_list)):
if(is_similar(my_string, my_list[i])):
print("Passed - Similar with", my_list[i], sep=" ")
else:
print("Failure")
Output:
Similarity Precentage: 0.0
Failure
Similarity Precentage: 0.6
Failure
Similarity Precentage: 0.4
Failure
Case 2:
If
my_string = "aplpe"
my_list=["orange", "apppe", "grape"]
Then, output:
Similarity Precentage: 0.0
Failure
Similarity Precentage: 0.8
Passed - Similar with apppe
Similarity Precentage: 0.4
Failure
Case 3:
If
my_string = "aplpe"
my_list=["orange", "aplpe", "grape"]
Then, output:
Passed - Identical
Upvotes: 0