lerman
lerman

Reputation: 51

is there a C function for regex using a deterministic automaton?

The POSIX regex functions compile the regular expressions into non-deterministic finite automata (NFAs). One problem with that is there is no way to tell at compilation time whether those automata will use excessive stack space or take excessive cpu time. That makes them (in some sense) unsuitable for use in a real time system.

An equivalent deterministic finite automaton executes in linear time. It disadvantage is that it may use an excessive number of states, which translates to a large amount of program memory. On the plus side, though, is the fact that you know the number of states used at the time you compile the regular expression.

That means you can know at regular expression compile time whether it is suitable for your application. That brings me to my question: Is there a regular expression library for C that compiles to a DFA? The answer to a question that might be as useful would answer the question: Is there a regular expression library for C that gives useful information on memory and cpu utilization?

Ken

Upvotes: 3

Views: 318

Answers (1)

NinjaDarth
NinjaDarth

Reputation: 393

  1. Yes. 2. It's a matter of simple algebra. 3. Here

https://github.com/RockBrentwood/RegEx

(originally in the comp.compilers archive.)

Here an early description on comp.compilers, from which this ultimately descended.

https://compilers.iecc.com/comparch/article/93-05-083

and another later description

https://compilers.iecc.com/comparch/article/93-10-022

The older version of the RegEx C programs on GitHub may be found in the AI repository at Carnegie Mellon University here

https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/regex

I will try to retcon the 1993-2021 evolution-stream of it into the current GitHub snapshot so that you can have the whole history, rather than just the snapshot of the latest versions. (It would be nice if GitHub supported retconning and retrofitting history-streams, by the way.)

An automaton is little more than the graphic display of a finite right linear system of inequations. Every rational expression is the least fixed point solution to such a system, which can be unfolded from the expression purely by algebraic means.

This is a general result of Kleene algebra, so it goes well beyond just regular expressions; e.g. rational subsets of any monoid; a special case being rational subsets of product monoids, which includes rational transductions as a special case. And the algebraic method used in the C routines is mostly (but not entirely) generic to Kleene algebras.

I'm trying to adapt the calculation in {nfa,dfa}.c to handle both inputs and outputs. There are a few places where it makes specific assumption that the Kleene algebra is the free Kleene algebra ( = regular expression algebra). And that has to be modified, to allow it to be generalized to non-free Kleene algebras, like the rational transductions.

Regular expressions over an alphabet $X$ comprise the Kleene algebra $ℜX^*$ of the rational subsets of the free monoid $X^*$ generated by $X$. Correspondingly, $ℜX^*$ is the free Kleene algebra generated by $X$.

The underlying theory (with respect to which "free" refers to) can be 1st-order or 2nd order.

The 1st-order theory (notwithstanding Conway's "no finite axiomatization" result, mis-stated and mis-applied as a folklore theorem) is a finitely axiomatized algebra consisting of (a) the axioms for a semiring, with an idempotent sum $x + x = x$ (usually denoted $x | x$) ... i.e. a "dioid", and (b) the corresponding partial ordering relation defined by ($x ≥ y ⇔ (∃z) x = z + y ⇔ x = x + y$); (c) the Kleene star operator $x ↦ x^*$, which (d) provides least fixed point solutions $b^* a c^* = μx (x ≥ a + bx + xc)$. (A set of axioms to embody (d) are $x^* = 1 + x x^*$ and $x ≥ a + bx + xc ⇒ x ≥ b^* a c^*$.) That dates from the mid 1990's by Kozen.

The algebra presented by the 1st order theory is not closed under congruence relations (because, in fact, all computations can be represented by a Kleene algebra taken over a suitably defined non-trivial monoid; so the word problem isn't solvable either). The 2nd order formulation - which predates the 1st order formulation - is the closure of the 1st order formulation under congruence. It has (a) the axioms of a dioid and (b) the least fixed points of all rational subsets and (c) distributivity with respect to the rational least fixed point. The last two axioms can be narrowed and combined into a single axiom for the least fixed point: $μ_{n≥0}(ab^nc) = ab^*c$.

Using the terminology in LNCS 4988 (https://link.springer.com/chapter/10.1007%2F978-3-540-78913-0_14), this comprises the category of "rational dioids" = rationally closed idempotent semirings. It has a tensor product ⊗, which was part of the suite of additional infrastructure and expanded terminology laid out in LNCS11194 (pp. 21-36, 37-52) https://dblp.org/db/conf/RelMiCS/ramics2018.html.

The software requires and uses only the 1st order formulation.

Rational transductions over an input alphabet $X$ and output alphabet $Y$, similarly, comprise the rational subsets $ℜ(X^* × Y^*)$ of the product monoid $X^* × Y^*$; and in the rational-dioid category, the rational transduction algebra is the tensor product $ℜ(X^* × Y^*) = ℜX^* ⊗ ℜY^*$ of the respective regular expression algebras.

In turn, that algebra is effectively just the algebra of regular expressions over the disjoint union of $X$ and $Y$ modulo the commutativity rule $xy = yx$ for $x ∈ X, y ∈ Y$, so the process can be adapted and generalized adapted to:

(a) "transducers" - where both X and Y are present,

(b) "generators", where only $Y$ is present and $X = {1}$,

(c) "recognizers", where only $X$ is present and $Y = {1}$ and even

(d) generalizations of these where multiple input and/or output channels are allowed.

Example: the Kleene algebra $ℜX_0^* ⊗ ℜX_1^* ⊗ ℜY_0^* ⊗ ℜY_1^*$ would be for transducers with two input channels (one with alphabet $X_0$ the other with alphabet $X_1$) and two output channels (with respective alphabets $Y_0$ and $Y_1$).

Upvotes: 1

Related Questions