Shrey Tripathi
Shrey Tripathi

Reputation: 220

Longest substring repeating atleast k times

I am given a string of large length (say, 100,000) and an integer k, and I have to compute length of the largest sub-string which repeats in the given string atleast k times.
I found answers to this particular question here and here, but I wanted to know if there is any other efficient method to solve this question other than suffix trees?

Upvotes: 0

Views: 1353

Answers (1)

fas
fas

Reputation: 1413

There was large discussion in comments, I think it's better write an answer to sum up. TL;DR Longest substring repeating atleast k times

There is less efficient method, but it's really easier to understand than suffix trees: all you need to know is polynomial hashing and binary search.

1. String polynomial hash

Read about it here https://cp-algorithms.com/string/string-hashing.html. Below is short description of this technique.

Let's say we have string s, integers p and mod. Now we can define hash function:

hash(s) = (ord(s[0])*p^(len(s)-1) + ord(s[1])*p^(len(s)-2) + ... + ord(s[len(s)-1])*p^0) % mod 

where ord is a function returning an integer by character (let's say it's an ASCII-code of a character). Polynomial hash can be easily calculated for each prefix of string in O(len(s)):

# h[i] is a hash of prefix of length i.
# For example s = "abacaba",
# h[0] = hash("") = 0
# h[1] = hash("a")
# h[2] = hash("ab")
# ...
# h[7] = hash("abacaba")

h[0] = 0
for i in 1..n:
    h[i] = (h[i-1] * p + ord(s[i-1])) % mod

Also let's precalculate p^0 % mod, p^1 % mod, ..., p^len(s) % mod in array pow:

# pow[i] is (p^i) % mod
pow[0] = 1
for i in 1..n:
    pow[i] = (pow[i-1] * p) % mod

Using arrays h and pow we can easily calculate hash of any substring of the string s:

# get_substring_hash returns hash(s[l] + s[l+1] + ... + s[r-1]).
def get_substring_hash(s, l, r):
    value = h[r] - h[l]*pow[r-l]    # line a
    return (value%mod + mod) % mod  # line b

Let's understand why code above works.

h[r] = (ord(s[r-1])*p^0 + ord(s[r-2])*p^1 + ... + ord(s[l-1])*p^(r-l) + ord(s[l-2])*p^(r-l+1) + ...) % mod
h[l] = (                                          ord(s[l-1])*p^0     + ord(s[l-2])*p^1       + ...) % mod
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As you can see we need only ^^^-part from h[r] so we have to get rid of ~~~-part. ~~~-part in h[r] p^(r-l) times larger than in h[l] and this explains line a.

Line b is kinda magic when operating with % mod, value after line a can be negative, so value%mod + mod definitely makes is positive. At the same time if value was positive after line a value%mod + mod is larger than mod, so (value%mod + mod) % mod will definitely return value in range 0, 1, ..., mod-1.

Finally, mod is a large prime number like 10^9+7 and value is a small number, but larger than any possible ASCII-code like 239 (read in article why so).

Some notes:

  • Hashes may collide, because we have only mod possible values of hash but number of strings is infinite. How to deal with it read in article.
  • Doing operations like h[r] - h[l]*pow[r-l] may require using of 64-bit types of integers.

2. Binary search

Just read about it on Wikipedia, there's nothing specific https://en.wikipedia.org/wiki/Binary_search_algorithm.

3. Longest substring repeating at least k times solution

Let's say we precalculated arrays h and pow. Let's do binary search to find maximum length ans of string such that there are k or more such substrings in the given string s.

Why binary search works here? Because if there's some length x such as there are k or more equal substrings in s of length x, then there's definitely k or more equal substrings in s of length x-1 (just remove last letter from our matches).

How will binary search work? Let's say we're currently testing if there are k or more equal substrings of length mid. We're going to calculate all hashes of length mid (using get_substring_hash) and we'll make a decision of changing borders of binary search if there is a k equal hashes or not.

For example: s = "abcabcdefgdefgdefgdefg", k = 3. Answer is "defgdefg":

abcabcdefgdefgdefgdefg
      ^^^^^^^^          occurence 1
          ^^^^^^^^      occurence 2
              ^^^^^^^^  occurence 3

Binary search iterations:

l =  1, r = 22, mid = 11. No substring of length 11 satisfy.
l =  1, r = 10, mid =  5. There should be hash("defgd")    be seen 3 times.
l =  5, r = 10, mid =  7. There should be hash("defgdef")  be seen 3 times.
l =  7, r = 10, mid =  8. There should be hash("defgdefg") be seen 3 times.
l =  8, r = 10, mid =  9. No substring of length 9  satisfy.
l =  8, r =  8.           That means answer is 8.

As you can see, complexity is O(n log n): round(log n) binary search iterations and O(n) complexity per iteration if you use something like std::unordered_map to check if there's a hash with >= k occurrences or not.

I really hope everything is clear now.

Upvotes: 1

Related Questions