Reputation: 1087
I want to find all substrings in the string S that match a regular expression R. The regular expression can contain only '.' and symbols (where '.' means any symbol). I'm trying to use KMP to solve this:
1) Build the string T = R + '#' + S ('#' is deliminator here)
2) Calculate the prefix-func. for T
3) For pi (prefix-func. of T) check positions after '#' where pi[i] == len(S). In these positions sought-for substring ends.
But prefix-func. will not work properly on strings with '.' My code for prefix-func.:
pi[0] = 0;
for (int j = 0, i = 1; i < R.length(); i++) {
while (j > 0 && s[i] != s[j] && s[i] != '.' && s[j] != '.' || s[i] == '#' || s[j] == '#')
j = pi[j - 1];
if (s[i] == s[j] || (s[i] == '.' && s[i] != '#') || (s[j] == '.' && s[j] != '#'))
j++;
pi[i] = j;
}
It fails on test S="abab", T="a.".
I know that it's possible to use KMP to solve this problem, so can you tell me how?
Upvotes: 0
Views: 1063
Reputation: 4661
See http://homepage.usask.ca/~ctl271/810/approximate_matching.shtml where a suffix tree-based algorithm is derived to find all occurences of a pattern P of length m with k wild cards in a string of length n in O(kn) time, which for k << m can be much better than naive O(nm) time achieved by checking all substrings of length m for a match.
Upvotes: 1
Reputation: 14205
I don't know a modification of KMP that handles don't-care symbols. You can build a deterministic string-matching automaton instead, or possibly use an Aho-Corasick variant on the consecutive non-don't-care symbols. I don't know how to prove a good worst-case bound on either of these.
Adam Kalai's one-pager from SODA 2002 discusses a very simple FFT-based approach to this problem. I might suggest using that if worst-case complexity is important.
Upvotes: 0