Lexical Analysis - Extraction tokens from DFA constructed directly

Question

I've been reading the dragon book and was really interested in the algorithm that converts a regex directly to a DFA (without an explicit NFA). Suppose my lexical file layout is like lex:

...
%%
if                       { return IF_TOK;   }
else                     { return ELSE_TOK; }
then                     { return THEN_TOK; }
[a-zA-z][a-zA-Z0-9]*     { return IDENT;    }
...more tokens
%%
...

I've implemented the algorithm and was wondering how I could extract tokens from it. I know that when an NFA constructed with Thompson's construction (the dragon book's variation), then converted to a DFA, you can get the tokens from the NFA states a DFA is composed of, but I am unsure in this case.

rici · Accepted Answer

A (f)lex description -- indeed, any lexical description -- is a collection of regular expressions. However, all practical algorithms match a single regular expression. This isn't a problem because we can simply construct a single regular expression consisting of an alternation between all the possible tokens. So the first four patterns in your example would be combined into

(if|then|else|[a-zA-z][a-zA-Z0-9]*)

The direct regex->DFA algorithm given in the Dragon book works on an augmented regular expression, constructed by adding a single end-of-input marker at the end of the regex. States which include the position of this augmented marker are accepting states.

That's convenient for the description of an algorithm, but not very helpful for the construction of an actual tokeniser, because the tokeniser rarely reaches the end of the input. It just keeps going until no more transitions are possible, at which point it backs up to the last accepting state.

States in the regex->DFA algorithm correspond to sets of positions in the original regular expression. (If you've written the algorithm you know what a position is, but for the sake of other people reading along, a position is roughly speaking the offset in the regular expression of the character literal (or wildcard character) which has just matched. Not all offsets in the regular expression are positions in this sense; parentheses and operators do not correspond to any matched character, for example. And an entire character class literal is a single position. But it's often easier to think about positions as though they were simply byte offsets in the regular expression itself.)

That's useful, because we can see instantly from the name of the state where in the regular expression the scan has reached. And since each pattern in the combined regular expression has a different range of positions, we know which rule we're in. If the pattern has succeeded, the rule number tells us which action to choose. If the accepting state is in more than one rule, we just choose the rule with the smallest position, because that's the (f)lex algorithm for selecting between two longest matches of the same length: choose the first one in the scanner definition.

But, as we noted earlier, the only accepting state is the one corresponding to the end-of-input marker at the end of the augmented pattern, and that is not in any of the original patterns. So if we want to identify the actual pattern, we need to augment each individual rule with its own end marker, rather than having all the rules share a single end marker. Furthermore, the end marker cannot be restricted to just matching the (fictional) character at the end of the input, because we're interested in the longest matching prefix rather than a complete match. So we have to consider the end marker as matching any single character, including the fictional end-of-input character. (That makes it a wilder wildcard than ., since . only matches real characters, not fictional end-of-input characters, and also -- in the real (f)lex implementation -- only real characters which happen not to be newline characters. But in principle, it's a kind of wildcard character, except that it doesn't absorb the character it matches.)

So the end result is that we transform the collection of patters into:

(if#|then#|else#|[a-zA-z][a-zA-Z0-9]*#)

where # is the end-of-match wildcard (or, more accurately, zero-length assertion which always succeeds) as described above. Then, every state which includes one (or more) of the # wildcards is an accepting state. The position it corresponds to is the smallest position in its name which corresponds to a #, and the rule action which should be taken is the one whose augmented pattern includes that #.

Lexical Analysis - Extraction tokens from DFA constructed directly

Answers (1)

Related Questions