Reputation: 25
Suppose we have a n letter string and we are searching for most repeated m letter substring (1=<m =< n). I am just searching for an algorithm which solves this problem in linear time. And I have reached to suffix tree. But how can I solve it by suffix tree?
Thanks a lot.
Upvotes: 0
Views: 99
Reputation: 128
You can also solve it with hash function.
We can convert strings to base b
numbers where b
is a prime number.
For example, if the strings only consist of lowercase alphabet (26 characters a-z
) then we can choose b
equals 29
.
Then we map string characters to corresponding numbers. For example:
a -> 1
b -> 2
c -> 3
...
z -> 26
String abc
will equals 29^2*1 + 29^1*2 + 29^0*3 = 899
in base 29
.
We should map a -> 1
but not a -> 0
since hash value of aaa
and aa
will be equal in base b
, which shouldn't be.
Now instead of compare two strings, we can compares their hash value in base b
. If their hash value are equal then we can say they are equal.
Since hash value can be very large, you can use it's module a large prime number, for example mod 1e9+7
. The posibility of two different strings have same hash value is very low in this case.
The algorithm can be described as bellow:
Let n-letter string be S
Let hash(s) be function to get hash value of string s
For each m-letter-substring of S, call it s
Increase the number of occurrences of hash(s), let call its o(hash(s))
Result will be the m-letter-substring s with the maximum o(hash(s))
To calculate hash(s)
, first we build array H
where:
H[i] = (b^(i-1)*S[1] + b^(i-2)*S[2] + b^(i-3)*S[3] + ... + b^0*S[i]) % mod
Here S[i]
is the mapped number of character i-th of string S.
To calculate b^x
, we can calculate array powb
where:
powb[0] = 1; powb[i] = (powb[i - 1] * b) % mod
Then for a substring s[l..r]
of string S,
hash(s[l..r]) = (H[r] - H[l-1]*b^(r-l+1)) % mod
As we can see, hash(s)
can be negative, in this case we should add mod
to hash(s)
(hash(s) += mod
).
O(N) to calculate H, powb
O(N) to iterate every substring s
For each s
O(1) to calculate hash(s)
O(log(N)) to calculate total occurrences of hash value (C++ map)
Total complexity: O(N log N)
Upvotes: 1