Reputation: 63
What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:
text1 = 'cat is sleeping on the mat'.
text2 = 'The cat is sleeping on the red mat in the living room'.
In the above example, all the words of text1
are present in the text2
completely, hence the similarity should be 100%.
If some words of text1
are missing, the score shall be less.
I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.
I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.
Upvotes: 5
Views: 6980
Reputation: 1
I think this can be achieved with Levenshtein distancing combined with substring matching. What you can do is split a sentence into smaller words (using spaces as a separator), then run the Levenshtein matching algorithm to match individual words with your search string. Something like:
def similar_word(string, substring):
threshold=2
def levenshtein_distance(s1, s2):
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
for j in range(n + 1):
if i == 0: dp[i][j] = j
elif j == 0: dp[i][j] = i
elif s1[i - 1] == s2[j - 1]: dp[i][j] = dp[i - 1][j - 1]
else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
return dp[m][n]
for i in range(len(string) - len(substring) + 1):
distance = levenshtein_distance(string[i:i + len(substring)], substring)
if distance <= threshold: return True
return False
https://gist.github.com/4f77616973/66a784c4c5921359299d603419a8f01b
Since you want the score, you can tweak the above code to return the distance instead of True
/False
.
Hope that helps! :)
Upvotes: 0
Reputation: 7838
Based on your description, how about:
>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
if word in b: #if it is in your bigger string increase score
score += 1
>>> score/len(a) #obtain percentage given total word number
1.0
In case it had a missing word for example:
>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
if w in b:
score +=1
>>> score/len(c)
0.875
Additionally, you can do as @roadrunner suggest and split b
and save it as a set to speed up your performance with b = set(b.split(" "))
. This will reduce that part complexity to O(1)
and improve the overall algorithm to a O(n)
complexity.
Edit: You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.
Upvotes: 7
Reputation: 26335
You could also use a collections.defaultdict
to store the counts of words in word_a
that exist in word_b
, then sum()
the counts divided by the length of word_a
at the end:
from collections import defaultdict
a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"
word_a = a.split()
word_b = set(b.split())
d = defaultdict(int)
for word in word_a:
if word in word_b:
d[word] += 1
print(sum(d.values()) / len(word_a))
Which Outputs:
0.875
Note: Since we only care about checking if words in word_a
exist in word_b
, then converting word_b
to a set()
will allow O(1)
lookup, instead of keeping it a list, which will be O(n)
. This then makes the overall time complexity of the above code O(n)
.
Upvotes: 4
Reputation: 99
Similar to DarkCygbus but the similarity is based on its count total character instead of words. On the other hand, this script has only checked coincidence with the complete words (text_2.split())
from __future__ import division
text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0
for word in text_1.split():
if word not in text_2.split():
no_match += len(word)
else:
match += len(word)
similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))
Upvotes: 2