Merging duplicate path nodes

Question

Consider the following trivial data structure:

data Step =
  Match Char |
  Options [Pattern]

type Pattern = [Step]

This is used together with a small function

match :: Pattern -> String -> Bool
match [] _ = True
match _  "" = False
match (s:ss) (c:cs) =
  case s of
    Match   c0 -> (c == c0) && (match ss cs)
    Options ps -> any (\ p -> match (p ++ ss) (c:cs)) ps

It should be fairly obvious what is going on here; a Pattern either does or does not match a given String based on the steps it contains. Each Step either matches a single character (Match), or it consists of a list of possible sub-patterns. (Note well: sub-patterns are not necessarily of equal length!)

Suppose we have a pattern such as this:

[
  Match '*',
  Options
    [
      [Match 'F', Match 'o', Match 'o'],
      [Match 'F', Match 'o', Match 'b']
    ],
  Match '*'
]

This pattern matches two possible strings, *Foo* and *Fob*. Clearly we can "optimise" this into

[Match '*', Match 'F', Match 'o', Options [[Match 'o'], [Match 'b']], Match '*']

My question: How do I write the function to do this?

More generally, a given Options constructor may have an arbitrary number of sub-paths, of wildly different lengths, some with common prefixes and suffixes, and some without. It's even possible to have empty sub-paths, or even to do something like Options [] (which is of course no-op). I'm struggling to write a function which will reduce every possible input correctly...

sclv · Accepted Answer

On cursory inspection this looks like you've defined a nondeterministic finite state automata. NFA were first defined by Michael O. Rabin, and of all peope—Dana Scott, who has brought us much else as well!

This is an automata because it is built out of steps, with transitions between them, based on acceptance states. At each step you have many possible transitions. Hence your automata is nondeterministic. Now you want to optimize this. One way to optimize it (not the way you're asking for, but related) is to eliminate backtracking. You can do this by taking every combination of how you get to a state along with the state itself. This is known as the powerset construction: http://en.wikipedia.org/wiki/Powerset_construction

The wikipedia article is actually pretty good -- and in a language like Haskell we can first define the full powerset DFA, then lazily traverse all genuine paths to "strip out" most of the unreachable cruft. That gets us to a decent DFA, but not necessarily a minimal one.

As described at the bottom of that article, we can use Brzozowski's algorithm, flipping all the arrows and getting a new NFA that describes going from end to initial states. Now if we were minimizing a DFA, we'd need to go from there back to the DFA again, then flip the arrows and do it all again. This isn't necessarily the fastest approach, but its straightforward and works well enough for plenty of cases. There are plenty of better algorithms available as well: http://en.wikipedia.org/wiki/DFA_minimization

For minimizing an NFA, there are a variety of approaches, but the problem is in general np-hard, so you'll have to pick some poison :-)

Of course all this is assuming you have a full NFA. If you have mutually recursive definitions, then you can put a pattern "inside" itself, and you sure do. That said, you'll then need to use clever tricks to recover the explicit shared structure in order to even begin working with an NFA in this form -- otherwise you'll loop forever.

If you insert a "no sharing" rule -- i.e. the directed graph of your NFA is not only acyclic, but branches never 'merge back' except when you exit an 'options' set, then I'd imagine that simplification is a much more straightforward affair, just 'factoring out' common characters. Since this involves thinking and not just providing references, I'll leave it there for now, just noting that this article might somehow be of interest: http://matt.might.net/articles/parsing-with-derivatives/

p.s.

A stab at the "factoring" solution is a function with the following type:

factor :: [Pattern] -> (Maybe Step, [Pattern])
factor = -- pulls out a common element of the pattern head, should one exist. shallow.

factorTail = -- same, but pulling out of the pattern tail

simplify :: [Pattern] -> [Pattern]
simplify = -- remove redundant constructs, such as options composed only of other options, which can be flattened out, options with no elements that are the "only" option, etc. should run "deep" all levels down.

Now you can start at the lowest level and cycle (simplify . factor) until you have no new factors. Then do so with (simplify . factorTail). Then go one level up, do the same thing. I wouldn't be shocked if you couldn't "trick" this into a nonminimal solution, but I think for most cases it will work very well.

Update: What this solution doesn't address is something where you have e.g. Options ["--DD--", "++DD++"] (reading strings as list of matches), and so you have uncommon structure in both the head and tail but not in the middle. A more general solution in such an instance would be to pull out the least common substring between all matches in your list, and use that as the "frame" with options inserted in the sections where things differ.

Merging duplicate path nodes

Answers (1)

Related Questions