consumer
consumer

Reputation: 709

Finding the consecutive substring match

I have say two strings;

str1="wild animals are trying to escape the deserted jungle to the sandy island"
str2="people are trying to escape from the smoky mountain to the sandy road"

In order to find the match between these two strings, kgrams of certain length(here 10) are produced, their hashes are produced and the hashes of these two strings are compared. Say for example if the matching kgrams from these two strings are;

['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy'] 

Please suggest me an efficient way of finding the consecutive substing (kgram) match from these kgrams. In the above case the actual answer would be

"aretryingtoescape"  

Thanking you in advance!!!

Upvotes: 0

Views: 1089

Answers (2)

msbrogli
msbrogli

Reputation: 493

Code following Ignacio's idea:

#!/usr/bin/env python

from itertools import groupby

str1 = 'wild animals are trying to escape the deserted jungle to the sandy island'
str2 = 'people are trying to escape from the smoky mountain to the sandy road'

words = ['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']

def solve(strings, words):
    s = min([ s.replace(' ', '') for s in strings ], key=len)
    coverage = [False]*len(s)
    for w in words:
        p = s.find(w)
        if p >= 0:
            for i in range(len(w)):
                coverage[p+i] = True
    return max([ ''.join([ y[1] for y in g ]) for k, g in groupby(enumerate(s), key=lambda x: coverage[x[0]]) if k ], key=len)

print solve([str1, str2], words)

Upvotes: 0

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799450

First make yourself a coverage mask consisting of 0 and 1 (or other characters if you prefer), then find the longest run of 1s with itertools.groupby().

Upvotes: 2

Related Questions