Should operators and keywords be put in the lexer or the parser?

Question

For handling operator symbols and keywords like +, *, if, and import, should they be written as lexer rules, or should they be directly written into the grammar? Is there any significant performance difference either way? Does it depend on the situation? What should I consider when making decisions about what to run through the lexer and what to leave for parse-time?

Mike Lischke · Accepted Answer

Lexer and parser, even though they use the same approach, have different purposes. A lexer classifies character input by scanning a range of characters and assigning that a number (a token type). The lexer can skip/ignore certain char sequence and can use advanced text matching technology like character ranges and Unicode character classes. It's also possible to fine tune the matching process by ordering token rules in a specific order.

Nothing of all that can be done in the parser, which takes tokens to construct sentences which are part of the language represented by the grammar. Even though you can define char literals in the parser (which are then called implicit tokens), I don't recommend doing that. The reason is that you have no control over the token in any way, they are created with aribitrary names (so you have no means to check for them in a listener, for example) and they can conflict with other tokens in a way you cannot resolve.

So, as a general rule: define character sequences that always belong together, separated by certain delimiters like whitespaces, as a single token in the lexer (keywords, strings, numbers, operators and the like are typical examples). Because delimiters (like the whitespaces) can be hidden or even skipped and never reach the parser, it becomes much easier then in the parser to deal with token sequences.

Should operators and keywords be put in the lexer or the parser?

Answers (1)

Related Questions