Reputation: 589
What is the basic approach used by perl, python, java and vim etc, to implement parsing with regular expressions?
Not the clever formal language approaches (i.e. NFA
, DFA
); nor parser combinators (e.g. 14 line regex engine).
I've had a look at the source for java's implementation of perl style regular expressions, but its sophistcated features (e.g. back-references) and efficiency (e.g. Boyer-Moore substring matching), made it difficult to see how it basically works.
EDIT
Various sources say "backtracking" is involved (e.g. Regular Expression Matching Can Be Simple And Fast; formal methods courses), but are not clear on exactly what is backtracked on... is it a way to evaluate an NFA
? Can it be done from the AST of the regex directly?
What do java/perl/python regex engines actually do?
Is it something like: "a way to generate all possible words in the regular language, but abandon a specific word as soon as it doesn't match the input string".
Upvotes: 1
Views: 810
Reputation: 589
The various engines seem to first construct (or "compile") an NFA from the regular expression, then execute the NFA, following transitions from state to state, and backtracking to a previous state when a route fails. Please note: the backtracking is done on the NFA, not on the regular expression.
An NFA is a kind of automata, with a graph of nodes (states) connected by directed arcs (transitions). An arc is labeled with a character; when that character is seen in the text, that arc is followed. An NFA is a Non-deterministic Finite Automata - the "non-deterministic" part means that two arcs leaving a node can have the same label, so that both are followed simultaneously. There can also be epsilon transitions (ε, representing the empty string - no character), which are always followed without needing a input character.
Backtracking means when we come to a choice, we checkpoint the state (where we are in the automata and in the text). Then, make a choice. If it doesn't work out, we revert or "backtrack" to that checkpoint, and tnry the next choice. Like navigating a cave, when might go deeper and deeper through a series of forks before we hit a dead-end, "nesting" the checkpoints.
The process for contructing an NFA from a regular expression is called "Thompson's construction". For details, please see MJD article or Regular Expression Matching Can Be Simple And Fast (about 1/5 the way through, section "Converting Regular Expressions to NFAs"). There's also a wikipedia article Thompson's construction.
Java's implementation (java/util/regex/Pattern.java
)
recursively-descent parses the regular expression, with methods expr() (alternation |
), sequence(), atom(), closure() (optional ?
and star *
), generating the NFA as Node objects with a next field for the transition to another Node. There are many, many different Node subclasses.
notes: (1) expr() calls sequence() which returns with a so-called "double return" - it returns head explicitly, and tail via a global field root; in effect a (head, tail) tuple. (2) There is no Sequence object; instead, the objects in a sequence have form a linked list, with their next fields.
transitions The NFA is traversed with match() methods. The match method on one Node subclass calls match() method on the next one, so that the path through the automaton is a recursive call. The last node is a LastNode object, that checks that the whole of the text has been used.
checkpointing The arguments of match() includes an index i into the text. It also has an implicit argument of the object it is called on. Together, this index and object represent the present state, and the matvch() invocation effectively checkpoints it, so we can backtrack to it later.
backtracking
The method match returns a boolean, of whether the match succeeded or not. If true, the returns chain all the way back to the initial call (popping off all the checkpoints, as if quickly retracing our steps back). But if false (due to a character not matching; or running out of text; or reaching the LastNode
while there's still text left), backtracking occurs.
The simplest backtracking is alternation. The code, in Branch.match (line 4599 of above source) tries the first choice; if it fails, troies the second choice; and so on. If all choices fail, return false.
Although the java implementation is may be the simplest of the engines, because least optimized, it is still very complex! There are many PCRE features, it handles unicode, many explicit optimizations (e.g. Boyer-Moore substring matching), but also many minor optimizations in the flow of the code itself, making it hard to learn from. Therefore, following is an ELI5 simplification, using only sequence and alternation (i.e. not even star or optional, so not actually regular expressions).
public class MyPattern {
public static void main(String[] args) {
// ab|ac // non-deterministic
Node eg =
new Branch(
new Sequence( new Single('a'), new Single('b') ),
new Sequence( new Single('a'), new Single('c') )
);
re = new Sequence(re, new LastNode());
System.out.println( re.match(0, "ac") );
}
abstract static class Node {
abstract void setNext(Node next);
abstract boolean match(int i, String s);
}
static class LastNode extends Node {
void setNext(Node next){ throw new RuntimeException("don't call me"); }
boolean match(int i, String s) {
return i==s.length();
}
}
static class Single extends Node {
Node next;
char ch;
Single(char ch) { this.ch = ch; }
void setNext(Node next) { this.next = next; }
boolean match(int i, String s) {
return i<s.length()&& s.charAt(i)==ch&& next.match(i+1, s);
}
}
static class Branch extends Node {
Node left, right;
Branch(Node l, Node r) { this.left=l; this.right=r; }
void setNext(Node next) {
left.setNext(next);
right.setNext(next);
}
boolean match(int i, String s) {
return left.match(i, s) || right.match(i, s);
}
}
static class Sequence extends Node {
Node left, right;
Sequence(Node l, Node r) {
this.left=l;
this.right=r;
left.setNext(right);
}
void setNext(Node next) { right.setNext(next); }
boolean match(int i, String s) {
return left.match(i, s);
}
}
}
To be proven: the above code parses a subset of regular expressions (i.e. without star or optional). In the following "regular expression" has this restricted meaning.
First we'll prove the constructed NFA from a regular expression is equivalent; then, that the code constructs an equivalent NFA; and finally that backtracking parsing is equivalent to that.
We'll begin by proving the plain regular-expression-to-NFA conversion is correct.
define regular expressionThe standard definition of the meaning of a regular expression is in terms of the set of words it generates (not what it parses). This set is called a language, and we say L(R) is the language of regular expression R.
It's very useful to be able to express every possible regular expression in the same way (as a language), because then we can ignore the details of the actual regular expression that generated it.
literal a
: L(a
) -> {a} - gives the set of that single-letter word
alternation: L(A|B) -> L(A) U L(B) - whenever two regular expressions A and B are or-ed together, we can easily work out the language produced. Assuming we know the language of each of the expressions, both can be generated, so the resulting language is just the union of their languages.
sequence: L(AB) -> L(A) X L(B) - the X means all combinations of: a word in L(A) followed by a word in L(B).
The language of a regular expression can be built up by finding the languages of the operations forming it (following an abstract syntax tree).
to NFAThe NFA of regular expression A is notated as N(A). A word generated is the sequence of characters encountered by following a path of transitions through an NFA from the start node to the exit node. Its language L(N(A)) is the set of all such words it can generate, by following all paths.
Defining it as a language enables us to compare NFAs with regular expressions.
sequence: The NFA for AB is constructed by putting N(A) and N(B) in sequence: the last node of N(A) and the first node of N(B) become the same node:
N(A) N(B)
O------->O------->O
The possible paths through this are: from the start node (on the left), any path can be taken through N(A) to the middle node; then, any path can be taken through N(B) to the exit node on the right. This produces the same language as L(AB) = L(A) X L(B)
(We assume that A and B have already been done correctly i.e. L(N(A))=L(A) and L(N(B))=L(A)).
alternation: The NFA for A|B is constructed by combining the start nodes of N(A) and N(B), and the exit nodes of N(A) and N(B):
_______
/ N(A) \
O >O
\_______/
N(B)
This combines the possible paths, just as L(A|B) -> L(A) U L(B), and so the same language is produced: L(N(A|B)) = L(A|B).
(Again, we assume that A and B have already been done correctly, L(N(A))=L(A) and L(N(B))=L(A)).
literal a
: The NFA for some character a
is simply a transition from start to exit node labeled with a:
a
O-------->O
The set of words produced by all possible paths through this NFA is simply {a} - the set of just the one, single-letter word a. This is the same as for the regular expression a
, or L(N(a)) = {a} = L(a).
In summary, as we build up the NFA from the regular expression, operator by operator, the language of the NFA is the same as the regular expression at each step of the way.
To prove that the code above constructs the NFA, we'll first prove that that the method setNext() sets all the last arc out of the NFA.
lemma: setNext() sets all exit arcsA Node object represents a node in an NFA. The exit arcs are all transitions to the NFA's exit node. In the code, these are represented by the next field referencing another node. We want to show that calling method setNext() will set all such exit arcs.
For a Sequence object (representing N(AB)), calling right.setNext() sets all its exit arcs, assuming the setNext() method of the component node right (representing N(B)) works correctly.
For a Branch object (representing N(A|B)), calling both left.setNext() and right.setNext() sets all its exit arcs, again assuming those methods work correctly (on the left and right nodes, representing N(A) and N(B)).
a
: The exit arc of NFA N(a) is simply the one arc labeled a.For a Single object (representing N(a)), setting the field next sets its exit arc.
Combining the above three, calling setNode on a Node object will set all exit arcs.
Proof: code constructs NFATo prove this without the complication of parsing and backtracking, we won't use match() yet, but add a new method gen() that prints out the language of the NFA.
In the code, a transition from one Node object to another is represented by the first calling a method on the second. The transfer of control to another Node object represents the transition to another node.
In the constructor, the exit arcs of the left node are connected to the right node by calling left.setNext(right). This is where setNode() calls originate, because a sequence is the only place where the concept of "following" is specified.
note: Sequence acts as scaffolding to assemble the NFA, and after passing control to the left node has no further riole.
alternation: In a Branch object, the entry arcs are connected to both the left and right nodes, by gen() calls both left.gen() and right.gen().
literal: In a Single node, control is transitioned to the next node with next.gen().
After the expression is constructed, all exit arcs are connected to a LastNode object by added as a sequence: new Sequence(re, new LastNode()).
Here are the gen() methods. They record the literal character encountered in the path of transitions, and are print when control reaches the LastNode.
abstract Node:
abstract void gen(String path);
LastNode:
void gen(String path) { System.out.println(path); }
Single:
void gen(String path) { next.gen(path+ch); }
Branch:
void gen(String path) { left.gen(path); right.gen(path); }
Sequence:
void gen(String path) { left.gen(path); }
Thus, calling gen("") on a Node object will print out its language.
In a sense, the method gen() above always "backtracks" at Branch, because it first recurses into the left node, and when that returns, the state is restored: we are back to the same Branch object, that respresents the NFA node, and the state of the word generated is also what it was before the left recursion. When it recurses into the right node, it is as if for the first time.
The change for "backtracking" in the method match is simply to not continue recursing if the sought word is already generated. This does not change which words are accepted.
The change for parsing is to check the word character by character, and to abandon an avenue when a character does not match, instead of continuing all the way to the LastNode. One way for a character to "not match" is to not exist i.e. when the text is too short. This is the i < s.length() check in Single.match().
Conversely, there could be too many characters in the text. This is addressed by LastNode.match() checking that all characters in the text have been read, with i==s.length().
Upvotes: 1
Reputation: 39158
Regex in Perl 2 were taken from Henry Spencer.
regexp.c
was only two thousand lines, it's not as difficult as later versions with more features.
Upvotes: 3
Reputation: 57600
There are two general approaches in regular expression engines.
Regular expressions can be transformed into finite automata. This relationship is well-studied in computer science. These finite automate can then be executed efficiently, and even run backwards. They provide strong guarantees, such as running in linear time and constant space with regards to the input string, but creating the finite automaton from the regex can be expensive. This approach also restricts the engine to real regular expressions, i.e. precludes advanced features such as back-references or recursion.
Regular expressions can be interpreted by a backtracking engine. If an alternative in the pattern fails, this can back-track to the last decision point and try a different approach. This is extremely flexible and (with extra features like recursion + named subpatterns) can parse a much larger class of formal languages (formally, something like the LL(*) set of grammars). This is very similar to PEG parsers. The big drawback: due to the backtracking, running the regex can take exponential time – even without any extra advanced features.
On top of that, regex engines feature extra optimizations such as first searching for constant substrings in the pattern, since that is more efficient than running any kind of regex (any may even be able to use vectorized CPU instructions). If there is a choice point between multiple constant strings, those can be compiled very easily into a trie data structure (effectively, a simple finite automaton). This can reduce the amount of backtracking.
A regex that demonstrates the difference between finite automata and backtracking is the a*a*a*a*a*b
pattern on the string aaaaaaaaaaaaaaacb
. A finite automaton can easily see that this pattern will not match because of the c
in the input. But a backtracking engine now has many many decision points where it can try different lengths for each a*
subpattern. Regex engines like Perl's or the re
module in Python go exponential in this case, i.e. take very long to complete – add more a
s to the input to make it take longer. This allows for interesting denial of service attacks if untrusted users can supply arbitrary regexes. For untrusted input, only finite-automata based regex engines should be used, e.g. RE2 from Google.
Upvotes: 7