Reputation: 220
I am given a string of large length (say, 100,000) and an integer k, and I have to compute length of the largest sub-string which repeats in the given string atleast k times.
I found answers to this particular question here and here, but I wanted to know if there is any other efficient method to solve this question other than suffix trees?
Upvotes: 0
Views: 1353
Reputation: 1413
There was large discussion in comments, I think it's better write an answer to sum up. TL;DR Longest substring repeating atleast k times
There is less efficient method, but it's really easier to understand than suffix trees: all you need to know is polynomial hashing and binary search.
Read about it here https://cp-algorithms.com/string/string-hashing.html. Below is short description of this technique.
Let's say we have string s
, integers p
and mod
. Now we can define hash function:
hash(s) = (ord(s[0])*p^(len(s)-1) + ord(s[1])*p^(len(s)-2) + ... + ord(s[len(s)-1])*p^0) % mod
where ord
is a function returning an integer by character (let's say it's an ASCII-code of a character). Polynomial hash can be easily calculated for each prefix of string in O(len(s)):
# h[i] is a hash of prefix of length i.
# For example s = "abacaba",
# h[0] = hash("") = 0
# h[1] = hash("a")
# h[2] = hash("ab")
# ...
# h[7] = hash("abacaba")
h[0] = 0
for i in 1..n:
h[i] = (h[i-1] * p + ord(s[i-1])) % mod
Also let's precalculate p^0 % mod, p^1 % mod, ..., p^len(s) % mod in array pow
:
# pow[i] is (p^i) % mod
pow[0] = 1
for i in 1..n:
pow[i] = (pow[i-1] * p) % mod
Using arrays h
and pow
we can easily calculate hash of any substring of the string s
:
# get_substring_hash returns hash(s[l] + s[l+1] + ... + s[r-1]).
def get_substring_hash(s, l, r):
value = h[r] - h[l]*pow[r-l] # line a
return (value%mod + mod) % mod # line b
Let's understand why code above works.
h[r] = (ord(s[r-1])*p^0 + ord(s[r-2])*p^1 + ... + ord(s[l-1])*p^(r-l) + ord(s[l-2])*p^(r-l+1) + ...) % mod
h[l] = ( ord(s[l-1])*p^0 + ord(s[l-2])*p^1 + ...) % mod
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see we need only ^^^
-part from h[r]
so we have to get rid of ~~~
-part. ~~~
-part in h[r]
p^(r-l)
times larger than in h[l]
and this explains line a.
Line b is kinda magic when operating with % mod
, value
after line a can be negative, so value%mod + mod
definitely makes is positive. At the same time if value
was positive after line a value%mod + mod
is larger than mod
, so (value%mod + mod) % mod
will definitely return value in range 0, 1, ..., mod-1.
Finally, mod
is a large prime number like 10^9+7 and value
is a small number, but larger than any possible ASCII-code like 239 (read in article why so).
Some notes:
mod
possible values of hash but number of strings is infinite. How to deal with it read in article.h[r] - h[l]*pow[r-l]
may require using of 64-bit types of integers.Just read about it on Wikipedia, there's nothing specific https://en.wikipedia.org/wiki/Binary_search_algorithm.
Let's say we precalculated arrays h
and pow
. Let's do binary search to find maximum length ans
of string such that there are k
or more such substrings in the given string s
.
Why binary search works here? Because if there's some length x
such as there are k
or more equal substrings in s
of length x
, then there's definitely k
or more equal substrings in s
of length x-1
(just remove last letter from our matches).
How will binary search work? Let's say we're currently testing if there are k
or more equal substrings of length mid
. We're going to calculate all hashes of length mid
(using get_substring_hash
) and we'll make a decision of changing borders of binary search if there is a k
equal hashes or not.
For example: s = "abcabcdefgdefgdefgdefg", k = 3. Answer is "defgdefg":
abcabcdefgdefgdefgdefg
^^^^^^^^ occurence 1
^^^^^^^^ occurence 2
^^^^^^^^ occurence 3
Binary search iterations:
l = 1, r = 22, mid = 11. No substring of length 11 satisfy.
l = 1, r = 10, mid = 5. There should be hash("defgd") be seen 3 times.
l = 5, r = 10, mid = 7. There should be hash("defgdef") be seen 3 times.
l = 7, r = 10, mid = 8. There should be hash("defgdefg") be seen 3 times.
l = 8, r = 10, mid = 9. No substring of length 9 satisfy.
l = 8, r = 8. That means answer is 8.
As you can see, complexity is O(n log n): round(log n) binary search iterations and O(n) complexity per iteration if you use something like std::unordered_map
to check if there's a hash with >= k occurrences or not.
I really hope everything is clear now.
Upvotes: 1