Kelvin  Zhang
Kelvin Zhang

Reputation: 980

Knuth–Morris–Pratt prefix table generation with wildcard

I'm implementing an KMP bytes pattern search with wildcard supported. Below is the algorithm for generating prefix table WITHOUT wildcard:

    vector<int> PrefixFunction(string S) 
    {
        vector<int> p(S.size());
        int j = 0;
        for (int i = 1; i < (int)S.size(); i++) 
        {
            while (j > 0 && S[j] != S[i])
                j = p[j-1];

           if (S[j] == S[i])
                j++;
           p[i] = j;
        }   
        return p;
   }

The problem with the algorithm above is that it does not work with wildcards. Long story short, consider the following math equations:

  1. a = b
  2. b = c
  3. a = c

Those equations are 100% true when no wildcards are involved. However, they do not apply to wildcards. EX: a = 1, b = ?? (?? is the wildcard) and c = 2. a equals b and b equals c, but a does not equal to c

Because of this weird property of wildcards, the algorithm mentioned above will not work! Consequently, I have to implement a specific algorithm for wildcards. My current implementation looks like following:

vector<int> GeneratePrefixTable(vector<int> bytes, vector<unsigned char> flags)
{
    vector<int> prefixTable(bytes.size());
    prefixTable[0] = 0;
    for (int j = 1, m = bytes.size(); j < m; j++)
    {
        int largest = 0;
        for (int i = 1; i < j; i++)
        {
            bool match = true;
            for (int k = 0; k < i; k++)
            {
                if (bytes[k] != bytes[j - i + k + 1] && !flags[k] && !flags[j - i + k + 1])
                {
                    //if the bytes do not match and neither of them is a wildcard
                    match = false;
                    break;
                }
            }
            if (!match)
            {
                continue;
            }
            largest = i;
        }

        prefixTable[j] = largest;
    }
    return prefixTable;
}

variable definitions:

  1. vector<int> bytes the pattern. aka needle.
  2. vector<unsigned char> flags flag array to indicate if the byte at a certain location is a wildcard
  3. j the index of the pattern we are looking at generating prefix #
  4. largestOne length of largest prefix found so far
  5. i length of prefix we are testing

Note that I did not halt the search once found a not working prefix length. This is also due to weird property of wildcard. For example, consider the pattern 01 02 ?? 02 00:

  1. with length of 1: prefix: 1 suffix:0, mismatch
  2. with length of 2: prefix: 1 2 suffix:2 0 mismatch
  3. with length of 3: prefix 1 2 ?? suffix: ?? 2 0 a match!

Because of this weird property, I have to test for every possible prefix length. This slows down my algorithm even more. What are some ways, algorithm wise and implementation wise, that I could speed up my prefix table generating algorithm?

Upvotes: 1

Views: 1674

Answers (1)

MSalters
MSalters

Reputation: 180145

You're confusing two types. The "character" type has one element less than the "search pattern, character or wildcard" type.

This causes you to mistakenly assume that "a" matches "??" (wildcard). It doesn't. The pattern "a" matches the character "a", the pattern "??" matches any character. Clearly different patterns.

It helps to consider regex patterns. [ab] is the same pattern as [ba]; [0-9] is the same pattern as [0123456789] but a definitely is NOT the same pattern as ..

Patterns are transitively equal. Again, using regex style: [abc] = [acb] = [bca].

KMP requires you to be careful about the distinction. You need to determine what the next possible match position is, given the current match position and the matched prefix length. This does not involve pattern equality. It involves the intersection of patterns.

Small sidestep: a pattern defines a set of strings to be accepted. Two patterns can be said to have an intersection, which is the set of strings that both accept. This is hard to calculate for complex patterns.

Practically speaking, KMP compares the search prefix found with a shifted version of the word searched. The value stored is the smallest shift for which the two patterns have a non-empty intersection.

Upvotes: 0

Related Questions