linear time algorithm for finding most frequent m-letter substring in a string

Suppose we have a n letter string and we are searching for most repeated m letter substring (1=<m =< n). I am just searching for an algorithm which solves this problem in linear time. And I have reached to suffix tree. But how can I solve it by suffix tree?

Thanks a lot.

Upvotes: 0

Answers (1)

Tan Nguyen

Reputation: 128

Idea

You can also solve it with hash function.

We can convert strings to base b numbers where b is a prime number.

For example, if the strings only consist of lowercase alphabet (26 characters a-z) then we can choose b equals 29.

Then we map string characters to corresponding numbers. For example:

a -> 1
b -> 2
c -> 3
...
z -> 26

String abc will equals 29^2*1 + 29^1*2 + 29^0*3 = 899 in base 29.

We should map a -> 1 but not a -> 0 since hash value of aaa and aa will be equal in base b, which shouldn't be.

Now instead of compare two strings, we can compares their hash value in base b. If their hash value are equal then we can say they are equal.

Since hash value can be very large, you can use it's module a large prime number, for example mod 1e9+7. The posibility of two different strings have same hash value is very low in this case.

Algorithm

The algorithm can be described as bellow:

Let n-letter string be S
Let hash(s) be function to get hash value of string s
For each m-letter-substring of S, call it s
   Increase the number of occurrences of hash(s), let call its o(hash(s))
Result will be the m-letter-substring s with the maximum o(hash(s))

To calculate hash(s), first we build array H where:

H[i] = (b^(i-1)*S[1] + b^(i-2)*S[2] + b^(i-3)*S[3] + ... + b^0*S[i]) % mod

Here S[i] is the mapped number of character i-th of string S.

To calculate b^x, we can calculate array powb where:

powb[0] = 1; powb[i] = (powb[i - 1] * b) % mod

Then for a substring s[l..r] of string S,

hash(s[l..r]) = (H[r] - H[l-1]*b^(r-l+1)) % mod

As we can see, hash(s) can be negative, in this case we should add mod to hash(s) (hash(s) += mod).

Complexity

O(N) to calculate H, powb
O(N) to iterate every substring s
  For each s
    O(1) to calculate hash(s)
    O(log(N)) to calculate total occurrences of hash value (C++ map)

Total complexity: O(N log N)

Upvotes: 1

linear time algorithm for finding most frequent m-letter substring in a string

Answers (1)

Idea

Algorithm

Complexity

Related Questions