Alexey
Alexey

Reputation: 4061

Which exactly part of parsing should be done by the lexical analyser?

Does there exist a formal definition of the purpose, or at a clear best practice of usage, of lexical analysis (lexer) during/before parsing?

I know that the purpose of a lexer is to transform a stream of characters to a stream of tokens, but can't it happen that in some (context-free) languages the intended notion of a "token" could nonetheless depend on the context and "tokens" could be hard to identify without complete parsing?

There seems to be nothing obviously wrong with having a lexer that transforms every input character into a token and lets the parser do the rest. But would it be acceptable to have a lexer that differentiates, for example, between a "unary minus" and a usual binary minus, instead of leaving this to the parser?

Are there any precise rules to follow when deciding what shall be done by the lexer and what shall be left to the parser?

Upvotes: 0

Views: 249

Answers (1)

rici
rici

Reputation: 241691

Does there exist a formal definition of the purpose [of a lexical analyzer]?

No. Lexical analyzers are part of the world of practical programming, for which formal models are useful but not definitive. A program which purports to do something should do that thing, of course, but "lexically analyze my programming language" is not a sufficiently precise requirements statement.

… or a clear best practice of usage

As above, the lexical analyzer should do what it purports to do. It should also not attempt to do anything else. Code duplication should be avoided. Ideally, the code should be verifiable.

These best practices motivate the use of a mature and well-document scanner framework whose input language doubles as a description of the lexical grammar being analyzed. However, practical considerations based on the idiosyncracies of particular programming languages normally result in deviations from this ideal.

There seems to be nothing obviously wrong with having a lexer that transforms every input character into a token…

In that case, the lexical analyzer would be redundant; the parser could simply use the input stream as is. This is called "scannerless parsing", and it has its advocates. I'm not one of them, so I won't enter into a discussion of pros and cons. If you're interested, you could start with the Wikipedia article and follow its links. If this style fits your problem domain, go for it.

can't it happen that in some (context-free) languages the intended notion of a "token" could nonetheless depend on the context?

Sure. A classic example is found in EcmaScript regular expression "literals", which need to be lexically analyzed with a completely different scanner. EcmaScript 6 also defines string template literals, which require a separate scanning environment. This could motivate scannerless processing, but it can also be implemented with an LR(1) parser with lexical feedback, in which the reduce action of particular marker non-terminals causes a switch to a different scanner.

But would it be acceptable to have a lexer that differentiates, for example, between a "unary minus" and a usual binary minus, instead of leaving this to the parser?

Anything is acceptable if it works, but that particular example strikes me as not particular useful. LR (and even LL) expression parsers do not require any aid from the lexical scanner to show the context of a minus sign. (Naïve operator precedence grammars do require such assistance, but a more carefully thought out op-prec architecture wouldn't. However, the existence of LALR parser generators more or less obviates the need for op-prec parsers.)

Generally speaking, for the lexer to be able to identify syntactic context, it needs to duplicate the analysis being done by the parser, thus violating one of the basic best practices of code development ("don't duplicate functionality"). Nonetheless, it can occasionally be useful, so I wouldn't go so far as to advocate an absolute ban. For example, many parsers for yacc/bison-like production rules compensate for the fact that a naïve grammar is LALR(2) by specially marking ID tokens which are immediately followed by a colon.

Another example, again drawn from EcmaScript, is efficient handling of automatic semicolon insertion (ASI), which can be done using a lookup table whose keys are 2-tuples of consecutive tokens. Similarly, Python's whitespace-aware syntax is conveniently handled by assistance from the lexical scanner, which must be able to understand when indentation is relevant (not inside parentheses or braces, for example).

Upvotes: 1

Related Questions