Knuth–Morris–Pratt prefix table generation with wildcard

Question

I'm implementing an KMP bytes pattern search with wildcard supported. Below is the algorithm for generating prefix table WITHOUT wildcard:

    vector PrefixFunction(string S) 
    {
        vector p(S.size());
        int j = 0;
        for (int i = 1; i < (int)S.size(); i++) 
        {
            while (j > 0 && S[j] != S[i])
                j = p[j-1];

           if (S[j] == S[i])
                j++;
           p[i] = j;
        }   
        return p;
   }

The problem with the algorithm above is that it does not work with wildcards. Long story short, consider the following math equations:

∵a = b
∵b = c
∴a = c

Those equations are 100% true when no wildcards are involved. However, they do not apply to wildcards. EX: a = 1, b = ?? (?? is the wildcard) and c = 2. a equals b and b equals c, but a does not equal to c

Because of this weird property of wildcards, the algorithm mentioned above will not work! Consequently, I have to implement a specific algorithm for wildcards. My current implementation looks like following:

vector GeneratePrefixTable(vector bytes, vector flags)
{
    vector prefixTable(bytes.size());
    prefixTable[0] = 0;
    for (int j = 1, m = bytes.size(); j < m; j++)
    {
        int largest = 0;
        for (int i = 1; i < j; i++)
        {
            bool match = true;
            for (int k = 0; k < i; k++)
            {
                if (bytes[k] != bytes[j - i + k + 1] && !flags[k] && !flags[j - i + k + 1])
                {
                    //if the bytes do not match and neither of them is a wildcard
                    match = false;
                    break;
                }
            }
            if (!match)
            {
                continue;
            }
            largest = i;
        }

        prefixTable[j] = largest;
    }
    return prefixTable;
}

variable definitions:

vector bytes the pattern. aka needle.
vector flags flag array to indicate if the byte at a certain location is a wildcard
j the index of the pattern we are looking at generating prefix #
largestOne length of largest prefix found so far
i length of prefix we are testing

Note that I did not halt the search once found a not working prefix length. This is also due to weird property of wildcard. For example, consider the pattern 01 02 ?? 02 00:

with length of 1: prefix: 1 suffix:0, mismatch
with length of 2: prefix: 1 2 suffix:2 0 mismatch
with length of 3: prefix 1 2 ?? suffix: ?? 2 0 a match!

Because of this weird property, I have to test for every possible prefix length. This slows down my algorithm even more. What are some ways, algorithm wise and implementation wise, that I could speed up my prefix table generating algorithm?

Knuth–Morris–Pratt prefix table generation with wildcard

Answers (1)

Related Questions