Boyer-Moore Galil Rule

Question

I was implementing the Boyer-Moore Algorithm for substring search in Python when I learned about the Galil Rule. I've looked around online for the Galil Rule but I haven't found anything more than a couple of sentences, and I cannot get access to the original paper. How can I implement this into my current algorithm?

i = 0
while i < (N - M + 1):
    skip = 0
    for j in reversed(range(0, M)):
        if pattern[j] != text[i + j]:
            skip = max(1, j - offsets[text[i+j]])
            break
    if skip == 0:
        return i
    i += skip
return -1

Notes:

offsets[c] = -1 if c is not in the pattern
offsets[c] = last index of c in the pattern

Example: aaabcb

offsets[a] = 2
offsets[b] = 5
offsets[c] = 4
offsets[d] = -1

The few sentences I have found have said to keep track of when the first mismatch occurs in my inner loop (j, if the if-statement inside the inner loop is True) and the position in which I started the comparisons (i + j, in my case). I understand the intuition that I've already checked all the indices in between those, so I shouldn't have to do those comparisons again. I just don't understand how to connect the dots and arrive at an implementation.

Lajos Nagy · Accepted Answer

The Galil rule is about exploiting periodicity in the pattern to reduce comparisons. Say you have a pattern abcabcab. It's periodic with smallest period abc. In general, a pattern P is periodic if there's a string U such that P is a prefix of UUUUU.... (In the above example, abcabcab is clearly a prefix of the repeating string abc = U.) We call the shortest such string the period of P. Let the length of that period be k (in the example above k = 3 since U = abc).

First of all, keep in mind that the Galil rule applies only after you've found an occurrence of P in the text. When you do that, the Galil rule says that you could shift by k (the periodicity of the pattern) and you only have to compare the last k characters of the now shifted pattern to determine if there was a match.

Here's an example:

P = ababa
T = bababababab
U = ab
k = 2

First occurrence: b[ababa]babab. Now you can shift by k = 2 and you only have to check the last two characters of the pattern:

T = bababa[ba]bab
P =    aba[ba]       // Only need to compare chars inside brackets for next match.

The rest of P must match since P is periodic and you shifted it by its period k from an existing match (this is crucial) so the repeating parts will nicely line up.

If you've found another match, just repeat. If you find a mismatch, however, you revert to the standard Boyer-Moore algorithm until you find another match. Remember, you can only use the Galil rule when you find a match and you shift by k (otherwise the pattern is not guaranteed to line up with the previous occurrence).

Now, you might wonder, how to determine k for a given pattern P. You'll need to calculate the suffixes array N first, where N[i] will be the length of the longest common suffix of the prefix P[0, i] and P. (You can calculate the suffixes array by calculating the prefixes array Z on the reverse of P using the Z algorithm, as described here, for example.) Once you have the suffixes array, you can easily find k since it'll be the smallest k > 0 such that N[m - k - 1] == m - k (where m = |P|).

For example:

P = ababa
m = 5
N = [1, 0, 3, 0, 5]
k = 2  because  N[m - k - 1] == N[5 - 2 - 1] == N[2] == 3 == 5 - k

Boyer-Moore Galil Rule

Answers (2)

Related Questions