Reputation: 709
I have say two strings;
str1="wild animals are trying to escape the deserted jungle to the sandy island"
str2="people are trying to escape from the smoky mountain to the sandy road"
In order to find the match between these two strings, kgrams of certain length(here 10) are produced, their hashes are produced and the hashes of these two strings are compared. Say for example if the matching kgrams from these two strings are;
['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']
Please suggest me an efficient way of finding the consecutive substing (kgram) match from these kgrams. In the above case the actual answer would be
"aretryingtoescape"
Thanking you in advance!!!
Upvotes: 0
Views: 1089
Reputation: 493
Code following Ignacio's idea:
#!/usr/bin/env python
from itertools import groupby
str1 = 'wild animals are trying to escape the deserted jungle to the sandy island'
str2 = 'people are trying to escape from the smoky mountain to the sandy road'
words = ['aretryingt', 'etryingtoe', 'ngtoescape', 'tothesandy']
def solve(strings, words):
s = min([ s.replace(' ', '') for s in strings ], key=len)
coverage = [False]*len(s)
for w in words:
p = s.find(w)
if p >= 0:
for i in range(len(w)):
coverage[p+i] = True
return max([ ''.join([ y[1] for y in g ]) for k, g in groupby(enumerate(s), key=lambda x: coverage[x[0]]) if k ], key=len)
print solve([str1, str2], words)
Upvotes: 0
Reputation: 799450
First make yourself a coverage mask consisting of 0
and 1
(or other characters if you prefer), then find the longest run of 1
s with itertools.groupby()
.
Upvotes: 2