Reputation: 980
I'm implementing an KMP bytes pattern search with wildcard supported. Below is the algorithm for generating prefix table WITHOUT wildcard:
vector<int> PrefixFunction(string S)
{
vector<int> p(S.size());
int j = 0;
for (int i = 1; i < (int)S.size(); i++)
{
while (j > 0 && S[j] != S[i])
j = p[j-1];
if (S[j] == S[i])
j++;
p[i] = j;
}
return p;
}
The problem with the algorithm above is that it does not work with wildcards. Long story short, consider the following math equations:
a = b
b = c
a = c
Those equations are 100% true when no wildcards are involved. However, they do not apply to wildcards. EX: a = 1
, b = ??
(?? is the wildcard) and c = 2
. a equals b
and b equals c
, but a does not equal to c
Because of this weird property of wildcards, the algorithm mentioned above will not work! Consequently, I have to implement a specific algorithm for wildcards. My current implementation looks like following:
vector<int> GeneratePrefixTable(vector<int> bytes, vector<unsigned char> flags)
{
vector<int> prefixTable(bytes.size());
prefixTable[0] = 0;
for (int j = 1, m = bytes.size(); j < m; j++)
{
int largest = 0;
for (int i = 1; i < j; i++)
{
bool match = true;
for (int k = 0; k < i; k++)
{
if (bytes[k] != bytes[j - i + k + 1] && !flags[k] && !flags[j - i + k + 1])
{
//if the bytes do not match and neither of them is a wildcard
match = false;
break;
}
}
if (!match)
{
continue;
}
largest = i;
}
prefixTable[j] = largest;
}
return prefixTable;
}
variable definitions:
vector<int> bytes
the pattern. aka needle.vector<unsigned char> flags
flag array to indicate if the byte at a certain location is a wildcardj
the index of the pattern we are looking at generating prefix #largestOne
length of largest prefix found so fari
length of prefix we are testingNote that I did not halt the search once found a not working prefix length. This is also due to weird property of wildcard.
For example, consider the pattern 01 02 ?? 02 00
:
1
suffix:0
, mismatch1 2
suffix:2 0
mismatch1 2 ??
suffix: ?? 2 0
a match!Because of this weird property, I have to test for every possible prefix length. This slows down my algorithm even more. What are some ways, algorithm wise and implementation wise, that I could speed up my prefix table generating algorithm?
Upvotes: 1
Views: 1674
Reputation: 180145
You're confusing two types. The "character" type has one element less than the "search pattern, character or wildcard" type.
This causes you to mistakenly assume that "a" matches "??" (wildcard). It doesn't. The pattern "a" matches the character "a", the pattern "??" matches any character. Clearly different patterns.
It helps to consider regex patterns. [ab]
is the same pattern as [ba]
; [0-9]
is the same pattern as [0123456789]
but a
definitely is NOT the same pattern as .
.
Patterns are transitively equal. Again, using regex style: [abc]
= [acb]
= [bca]
.
KMP requires you to be careful about the distinction. You need to determine what the next possible match position is, given the current match position and the matched prefix length. This does not involve pattern equality. It involves the intersection of patterns.
Small sidestep: a pattern defines a set of strings to be accepted. Two patterns can be said to have an intersection, which is the set of strings that both accept. This is hard to calculate for complex patterns.
Practically speaking, KMP compares the search prefix found with a shifted version of the word searched. The value stored is the smallest shift for which the two patterns have a non-empty intersection.
Upvotes: 0