miloserdow
miloserdow

Reputation: 1087

KMP modification - search for simple template match in string

I want to find all substrings in the string S that match a regular expression R. The regular expression can contain only '.' and symbols (where '.' means any symbol). I'm trying to use KMP to solve this:

1) Build the string T = R + '#' + S ('#' is deliminator here)

2) Calculate the prefix-func. for T

3) For pi (prefix-func. of T) check positions after '#' where pi[i] == len(S). In these positions sought-for substring ends.

But prefix-func. will not work properly on strings with '.' My code for prefix-func.:

pi[0] = 0;
for (int j = 0, i = 1; i < R.length(); i++) {
    while (j > 0 && s[i] != s[j] && s[i] != '.' && s[j] != '.' || s[i] == '#' || s[j] == '#')
        j = pi[j - 1];
        if (s[i] == s[j] || (s[i] == '.' && s[i] != '#') || (s[j] == '.' && s[j] != '#'))
            j++;
        pi[i] = j;
}

It fails on test S="abab", T="a.".

I know that it's possible to use KMP to solve this problem, so can you tell me how?

Upvotes: 0

Views: 1063

Answers (2)

user2566092
user2566092

Reputation: 4661

See http://homepage.usask.ca/~ctl271/810/approximate_matching.shtml where a suffix tree-based algorithm is derived to find all occurences of a pattern P of length m with k wild cards in a string of length n in O(kn) time, which for k << m can be much better than naive O(nm) time achieved by checking all substrings of length m for a match.

Upvotes: 1

tmyklebu
tmyklebu

Reputation: 14205

I don't know a modification of KMP that handles don't-care symbols. You can build a deterministic string-matching automaton instead, or possibly use an Aho-Corasick variant on the consecutive non-don't-care symbols. I don't know how to prove a good worst-case bound on either of these.

Adam Kalai's one-pager from SODA 2002 discusses a very simple FFT-based approach to this problem. I might suggest using that if worst-case complexity is important.

Upvotes: 0

Related Questions