Memo
Memo

Reputation: 5

string matching boyer moore..number of characters

in many examples of using Boyer moore algorithm, there is a declaration of 256 characters, I dont know what this number indicate for..please help

Example from (https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm):

function preprocess(pattern)
    T ← new table of 256 integers
    for i from 0 to 256 exclusive
        T[i] ← length(pattern)
    for i from 0 to length(pattern) - 1 exclusive
        T[pattern[i]] ← length(pattern) - 1 - i
    return T

Upvotes: 0

Views: 346

Answers (1)

mcskinner
mcskinner

Reputation: 2748

It is declaring that the alphabet has 256 characters in it.

This one byte limit works just fine for ASCII. But if you need Unicode, then you will also need more space in the table T. In fact, this space dependency is essential to the analysis of the algorithm.

As the Wikipedia article says:

The algorithm trades space for time in order to obtain an average-case complexity of O(n) on random text, although it has O(nm) in the worst case, where the length of the pattern is m and the length of the search string is n.

Boyer-Moore is O(n+m) on average, so this is faster theoretically. They're the same in the best case, and in pathological cases BMH can go off the rails more than BM will. But in practice the implementation of Boyer-Moore-Horspool is faster, because it uses space wisely. Which gets us back to that table T.

The fixed size table has gone out of fashion. You would probably use a dict or a HashMap or whatever your language of choice has on hand instead.

That brings the cost of the table down, a lot, for the case of capturing all Unicode characters. In fact it brings the space usage down from O(v) to O(min(v, n+m)).

Just be careful to use a hash-backed data structure so you don't accidentally add some log(v) factor (or worse) to your run time.

Upvotes: 0

Related Questions