Reputation: 1025
I was implementing the Boyer-Moore Algorithm for substring search in Python when I learned about the Galil Rule. I've looked around online for the Galil Rule but I haven't found anything more than a couple of sentences, and I cannot get access to the original paper. How can I implement this into my current algorithm?
i = 0
while i < (N - M + 1):
skip = 0
for j in reversed(range(0, M)):
if pattern[j] != text[i + j]:
skip = max(1, j - offsets[text[i+j]])
break
if skip == 0:
return i
i += skip
return -1
Notes:
Example: aaabcb
The few sentences I have found have said to keep track of when the first mismatch occurs in my inner loop (j, if the if-statement inside the inner loop is True) and the position in which I started the comparisons (i + j, in my case). I understand the intuition that I've already checked all the indices in between those, so I shouldn't have to do those comparisons again. I just don't understand how to connect the dots and arrive at an implementation.
Upvotes: 2
Views: 5377
Reputation: 111
The answer by @Lajos Nagy has explained the idea of Galil rule perfectly, however we have a more straightforward way to calculate k
:
Just use the prefix function of KMP algorithm.
The prefix[i]
means the longest proper prefix of P[0..i]
which is also a suffix.
And, k = m-prefix[m-1]
.
This article has explained the details.
Upvotes: 2
Reputation: 9465
The Galil rule is about exploiting periodicity in the pattern to reduce comparisons. Say you have a pattern abcabcab
. It's periodic with smallest period abc
. In general, a pattern P
is periodic if there's a string U
such that P
is a prefix of UUUUU...
. (In the above example, abcabcab
is clearly a prefix of the repeating string abc
= U
.) We call the shortest such string the period of P
. Let the length of that period be k
(in the example above k = 3
since U = abc
).
First of all, keep in mind that the Galil rule applies only after you've found an occurrence of P
in the text. When you do that, the Galil rule says that you could shift by k
(the periodicity of the pattern) and you only have to compare the last k
characters of the now shifted pattern to determine if there was a match.
Here's an example:
P = ababa
T = bababababab
U = ab
k = 2
First occurrence: b[ababa]babab
. Now you can shift by k = 2
and you only have to check the last two characters of the pattern:
T = bababa[ba]bab
P = aba[ba] // Only need to compare chars inside brackets for next match.
The rest of P
must match since P is periodic and you shifted it by its period k
from an existing match (this is crucial) so the repeating parts will nicely line up.
If you've found another match, just repeat. If you find a mismatch, however, you revert to the standard Boyer-Moore algorithm until you find another match. Remember, you can only use the Galil rule when you find a match and you shift by k
(otherwise the pattern is not guaranteed to line up with the previous occurrence).
Now, you might wonder, how to determine k
for a given pattern P
. You'll need to calculate the suffixes array N
first, where N[i]
will be the length of the longest common suffix of the prefix P[0, i]
and P
. (You can calculate the suffixes array by calculating the prefixes array Z
on the reverse of P
using the Z algorithm, as described here, for example.) Once you have the suffixes array, you can easily find k
since it'll be the smallest k > 0
such that N[m - k - 1] == m - k
(where m = |P|
).
For example:
P = ababa
m = 5
N = [1, 0, 3, 0, 5]
k = 2 because N[m - k - 1] == N[5 - 2 - 1] == N[2] == 3 == 5 - k
Upvotes: 7