Kevin Shannon
Kevin Shannon

Reputation: 35

How to parse an input string for matches given a DFA

I'm implementing a regular expression parser from scratch by generating a NFA from a regular expression and then a DFA from the NFA. The problem is DFA only can say when a computation is accepting. If the regex is "n*" and the string to match is "cannot" the DFA will go to a failed state after it sees the c, so then I drop the first character from the front, "annot" then "nnot". At this point it sees the n and goes to a final state and would just return the single n so I told it to keep trying until the next character would take it off a final state. however when it finishes it removes the first character again so it would be "not" and it would match the "n" but I don't want the subsequent matches, I only want "nn". I don't know how this would be possible.

Upvotes: 1

Views: 1082

Answers (1)

rici
rici

Reputation: 241771

Here's a simple but possibly not optimal algorithm. We try an anchored match at each successive point in the string by running the DFA starting at that point. When the DFA is being run, we record the last point in the string where the DFA was in an accepting state. When we eventually either reach the end of the string or a point at which the DFA can no longer advance, we can return a match if we passed through an accepting state; in other words, if we saved an accepting position, which will be the end of the match. Otherwise, we backtrack to the next starting position and continue.

(Note: in both pseudocode algorithms below, it's assumed that a variable which holds a string index can have an Undefined value. In a practical implementation, this value could be, for example, -1.)

In pseudocode:

Set <start> to 0
Repeat A:
     Initialise the DFA state the starting state.
     Set <scan> to <start>
     Set <accepted> to Undefined
     Repeat B:
        If there is a character at <scan> and
        the DFA has a transition on that character:
            Advance the DFA to the indicated next state
            Increment <scan>
            If the DFA is now in an accepting state, set <accepted> to <scan>
            Continue Loop B
        Otherwise, the DFA cannot advance:
            If <accepted> is still Undefined:
                Increment <start> and continue Loop A
            Otherwise, <accepted> has a value:
                Return a match from <scan> to <accepted> (semi-inclusive)

The problem with the above algorithm is that Loop B could execute an arbitrary number of times before failing and backtracking to the next starting position. So in the worst case, the search time would be quadratic in the string length. That would happen, for example, with pattern a*b and a string consisting of a large number of as.

An alternative is to run several DFAs in parallel. Each DFA corresponds to a different starting point in the pattern. We scan the string linearly; at each position, we can spawn a new DFA corresponding to that position, whose state is the initial state.

It's important to note that not every starting point has a DFA, because it is not necessary to keep two DFAs with the same state. Since the search is for the first match in the string, if two DFAs share the same state, only the one with the earlier starting point is a plausible match. Furthermore, once some DFA reaches an accepting state, it is no longer necessary to retain any DFA whose starting point comes later, which means that as soon as an accepting state is reached by any DFA, we no longer add new DFAs in the scan.

Since the number of active DFAs is at most the number of states in the DFA, this algorithm runs in O(NM) where N is the length of the string and M is the number of states in the DFA. In practice, the number of active DFAs is usually much less than the number of states (unless there are very few states).

Nonetheless, pathological worst cases still exist because the NFA⇒DFA transformation can exponentially increase the number of states. The exponential blowup can be avoided by using a collection of NFAs instead of DFAs. It's convenient to simplify the NFA transitions by using ε-free NFAs, either by doing an ε-closure on the Thompson automaton or by building a Glushkov automaton. Using the Glushkov automaton guarantees that the number of states is no greater than the length of the pattern.

In pseudocode:

Initialise a vector <v> of <index, state> pairs. Initially the vector
is empty, and its maximum size is the number of states. This vector is
always kept in increasing order by index.

Initialise another vector <active> of string indices, one for each state.
Initially all the elements in <active> are Undefined. This vector records
the most recent index at which some Automaton was in each state.

Initialise <match> to a pair of index positions, both undefined. This
will record the best match found so far.

For each position <scan> in the string:
    If <match> has not yet been set:
        Append {<scan>, <start_state>} to <v>.
    If <v> is empty:
        Return <match> if it has been set, or otherwise
        return a failure indication.
    Copy <v> to <v'> and empty <v>. (It's possible to recycle <v>,
    but it's easier to write the pseudocode with a copy.) 
    For each pair {<i>, <q>} in <v'>:
        If <i> is greater than the starting index in <match>:
            Terminate this loop and continue with the outer loop.
        If there is no transition from state <q> on the symbol at <scan>:
            Continue with the next pair.
        Otherwise, there is a transition to <q'> (if using NFAs, do this for each transition):
            If the index in <active> corresponding to <q'> has already
            been set to <scan>:
                Continue with the next pair.
            Otherwise, <q'> is not yet in <v>:
                Append the pair {<i>, <q'>} at the end of <v>.
                Set the the index in <active> at <q'> to <scan>.
                If <q'> is an accepting state:
                     Set <match> to {<i>, <scan>}.

Upvotes: 1

Related Questions