Sanjay Kamath
Sanjay Kamath

Reputation: 63

How to find a similar substring inside a large string with a similarity score in python?

What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

In the above example, all the words of text1 are present in the text2 completely, hence the similarity should be 100%.

If some words of text1 are missing, the score shall be less.

I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.

I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.

Upvotes: 5

Views: 6980

Answers (4)

0x4f
0x4f

Reputation: 1

I think this can be achieved with Levenshtein distancing combined with substring matching. What you can do is split a sentence into smaller words (using spaces as a separator), then run the Levenshtein matching algorithm to match individual words with your search string. Something like:

def similar_word(string, substring):
    threshold=2

    def levenshtein_distance(s1, s2):
        m, n = len(s1), len(s2)
        dp = [[0] * (n + 1) for _ in range(m + 1)]

        for i in range(m + 1):
            for j in range(n + 1):
                if i == 0: dp[i][j] = j
                elif j == 0: dp[i][j] = i
                elif s1[i - 1] == s2[j - 1]: dp[i][j] = dp[i - 1][j - 1]
                else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
        return dp[m][n]

    for i in range(len(string) - len(substring) + 1):
        distance = levenshtein_distance(string[i:i + len(substring)], substring)
        if distance <= threshold: return True
    
    return False

https://gist.github.com/4f77616973/66a784c4c5921359299d603419a8f01b

Since you want the score, you can tweak the above code to return the distance instead of True/False.

Hope that helps! :)

Upvotes: 0

DarkCygnus
DarkCygnus

Reputation: 7838

Based on your description, how about:

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

In case it had a missing word for example:

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

Additionally, you can do as @roadrunner suggest and split b and save it as a set to speed up your performance with b = set(b.split(" ")). This will reduce that part complexity to O(1) and improve the overall algorithm to a O(n) complexity.

Edit: You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.

Upvotes: 7

RoadRunner
RoadRunner

Reputation: 26335

You could also use a collections.defaultdict to store the counts of words in word_a that exist in word_b, then sum() the counts divided by the length of word_a at the end:

from collections import defaultdict

a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"

word_a = a.split()
word_b = set(b.split())

d = defaultdict(int)
for word in word_a:
    if word in word_b:
        d[word] += 1

print(sum(d.values()) / len(word_a))

Which Outputs:

0.875

Note: Since we only care about checking if words in word_a exist in word_b, then converting word_b to a set() will allow O(1) lookup, instead of keeping it a list, which will be O(n). This then makes the overall time complexity of the above code O(n).

Upvotes: 4

Carlos Fern&#225;ndez
Carlos Fern&#225;ndez

Reputation: 99

Similar to DarkCygbus but the similarity is based on its count total character instead of words. On the other hand, this script has only checked coincidence with the complete words (text_2.split())

from __future__ import division

text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0

for word in text_1.split():
    if word not in text_2.split():
        no_match += len(word)
    else:
        match += len(word)

similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))

Upvotes: 2

Related Questions