GCC/Clang lexer and parser

Question

I am curious about how C/C++ lexers and parsers work together. I know the parser generally needs lookahead of at least one token. My question though is, in production compilers (say gcc or clang):

1) Does the lexer run first, lex the entire file, and then let the parser generate the AST. This would mean the lexer generates a list of tokens.

OR

2) Does the lexer only generate a small set of tokens enough for the parser to do its job. This would mean the lexer and parser take turns running.

I definitely think option 1 is used since languages like C++ sometimes need arbitrary lookahead as the grammar isn't context free, but this would take a lot of memory.

torek · Accepted Answer

The traditional answer is close to your case 2, but not exactly that. Note that lexers and parsers are both typically implemented as relatively simple state machines.

The lexing state machine could be driven from either:

Get me a new token

(which obviously needs to get input codes and assemble them into tokens), or:

Here is a new input character

(which eventually causes tokens to "fall out" of the lexer).

The parser state machine can be driven from either direction:

Get me a parse

(which then has to obtain tokens until it finds a complete sentence), or:

Here is a new token

(which then has to assemble the tokens into a sentence).

If the parser algorithms we use were driven this way, we'd "compile" a file with:

for all input characters:
    feed character to tokenizer

and as tokens "fall out" of the tokenizer, they'd drive the parser. The whole thing would be coroutines driven from the bottom up.

Traditionally, in the parsers generated by yacc, bison, and so on, and in the lexers that serve them, we're running more "top-down", i.e., someone calls a get me a sentence function (which may build an AST, or directly emit code, or something in between—e.g., build one AST for one function or declaration, then turn that into intermediate code, then build another AST for another function or declaration, etc). That drives everything in the direction of pulling tokens in from the lexer—but it's still rather coroutine-ish, as the parser just asks for one token at a time.

This approach is also the obvious way to hand-code a recursive descent parser: your top function is "get me a sentence" (or "get me all sentences" or whatever) and eventually that leads to some function that calls "get me a token". So in both of these cases, the actual expression of the algorithm winds up making repeated "get me a token" calls to the lexer.

GCC has a hand-coded parser (and hand-coded lexer) that works in this way. I haven't looked at the innards of clang but I suspect it's the same.

As to C++ specifically, well, it has some very nasty parsing cases; see https://en.wikipedia.org/wiki/Most_vexing_parse and Pavel Minaev's answer to Is there a good Python library that can parse C++?. Some compilers use ad-hoc methods to deal with this, e.g., provide an overly accepting grammar and try to back-patch the eventual AST, or "steer" the grammar via hacks. (I've seen C++ compilers crash here: feed them syntactically valid tokens that make semantic nonsense, and the hacks can misfire.) Another, arguably much better, method is to use a GLR parser; see Ira Baxter's answer here.

(I haven't done anything parser-theory-ish in ages, and in writing this answer came across sjoerd's comment about GLL parsers from 2011, which is pretty interesting.)

GCC/Clang lexer and parser

Answers (1)

Related Questions