How to parse an input string for matches given a DFA

Question

I'm implementing a regular expression parser from scratch by generating a NFA from a regular expression and then a DFA from the NFA. The problem is DFA only can say when a computation is accepting. If the regex is "n*" and the string to match is "cannot" the DFA will go to a failed state after it sees the c, so then I drop the first character from the front, "annot" then "nnot". At this point it sees the n and goes to a final state and would just return the single n so I told it to keep trying until the next character would take it off a final state. however when it finishes it removes the first character again so it would be "not" and it would match the "n" but I don't want the subsequent matches, I only want "nn". I don't know how this would be possible.

rici · Accepted Answer

Here's a simple but possibly not optimal algorithm. We try an anchored match at each successive point in the string by running the DFA starting at that point. When the DFA is being run, we record the last point in the string where the DFA was in an accepting state. When we eventually either reach the end of the string or a point at which the DFA can no longer advance, we can return a match if we passed through an accepting state; in other words, if we saved an accepting position, which will be the end of the match. Otherwise, we backtrack to the next starting position and continue.

(Note: in both pseudocode algorithms below, it's assumed that a variable which holds a string index can have an Undefined value. In a practical implementation, this value could be, for example, -1.)

In pseudocode:

Set  to 0
Repeat A:
     Initialise the DFA state the starting state.
     Set  to 
     Set  to Undefined
     Repeat B:
        If there is a character at  and
        the DFA has a transition on that character:
            Advance the DFA to the indicated next state
            Increment 
            If the DFA is now in an accepting state, set  to 
            Continue Loop B
        Otherwise, the DFA cannot advance:
            If  is still Undefined:
                Increment  and continue Loop A
            Otherwise,  has a value:
                Return a match from  to  (semi-inclusive)

The problem with the above algorithm is that Loop B could execute an arbitrary number of times before failing and backtracking to the next starting position. So in the worst case, the search time would be quadratic in the string length. That would happen, for example, with pattern a*b and a string consisting of a large number of as.

An alternative is to run several DFAs in parallel. Each DFA corresponds to a different starting point in the pattern. We scan the string linearly; at each position, we can spawn a new DFA corresponding to that position, whose state is the initial state.

It's important to note that not every starting point has a DFA, because it is not necessary to keep two DFAs with the same state. Since the search is for the first match in the string, if two DFAs share the same state, only the one with the earlier starting point is a plausible match. Furthermore, once some DFA reaches an accepting state, it is no longer necessary to retain any DFA whose starting point comes later, which means that as soon as an accepting state is reached by any DFA, we no longer add new DFAs in the scan.

Since the number of active DFAs is at most the number of states in the DFA, this algorithm runs in O(NM) where N is the length of the string and M is the number of states in the DFA. In practice, the number of active DFAs is usually much less than the number of states (unless there are very few states).

Nonetheless, pathological worst cases still exist because the NFA⇒DFA transformation can exponentially increase the number of states. The exponential blowup can be avoided by using a collection of NFAs instead of DFAs. It's convenient to simplify the NFA transitions by using ε-free NFAs, either by doing an ε-closure on the Thompson automaton or by building a Glushkov automaton. Using the Glushkov automaton guarantees that the number of states is no greater than the length of the pattern.

In pseudocode:

Initialise a vector  of  pairs. Initially the vector
is empty, and its maximum size is the number of states. This vector is
always kept in increasing order by index.

Initialise another vector  of string indices, one for each state.
Initially all the elements in  are Undefined. This vector records
the most recent index at which some Automaton was in each state.

Initialise  to a pair of index positions, both undefined. This
will record the best match found so far.

For each position  in the string:
    If  has not yet been set:
        Append {, } to .
    If  is empty:
        Return  if it has been set, or otherwise
        return a failure indication.
    Copy  to  and empty . (It's possible to recycle ,
    but it's easier to write the pseudocode with a copy.) 
    For each pair {, } in :
        If  is greater than the starting index in :
            Terminate this loop and continue with the outer loop.
        If there is no transition from state  on the symbol at :
            Continue with the next pair.
        Otherwise, there is a transition to  (if using NFAs, do this for each transition):
            If the index in  corresponding to  has already
            been set to :
                Continue with the next pair.
            Otherwise,  is not yet in :
                Append the pair {, } at the end of .
                Set the the index in  at  to .
                If  is an accepting state:
                     Set  to {, }.

How to parse an input string for matches given a DFA

Answers (1)

Related Questions