Should tokenizer return language keywords?

Question

I am writing a toy-compiler for a toy-language, let's suppose it has JavaScript syntax.

Let's say that the source file is:

var val = 123;

My simple compiler will consist of a Tokenizer and a Parser (for now).

Should the Tokenizer return entire language keywords, e.g. var or letter by letter (v, a, r) ?

Sooner or later I will have to recognize keywords, literals etc. and I wonder where is the place for this kind of work?

olydis · Accepted Answer

The tokenizer should usually already return entire keywords (= tokens).

There is no disadvantage of doing so: As soon as your tokenizer decides that it is a language keyword (and not a number or similar), why should you "weaken" this information by splitting something you already successfully detected up in parts ;)

So more generally: don't hesitate to let the tokenizer output as large building-blocks as possible - as long as you do not give them any more meaning, which should be left to the parser.

Should tokenizer return language keywords?

Answers (2)

Related Questions