Reputation: 241
I am writing a toy-compiler for a toy-language, let's suppose it has JavaScript syntax.
Let's say that the source file is:
var val = 123;
My simple compiler will consist of a Tokenizer and a Parser (for now).
Should the Tokenizer return entire language keywords, e.g. var
or letter by letter (v
, a
, r
) ?
Sooner or later I will have to recognize keywords, literals etc. and I wonder where is the place for this kind of work?
Upvotes: 2
Views: 157
Reputation: 3310
The tokenizer should usually already return entire keywords (= tokens).
There is no disadvantage of doing so: As soon as your tokenizer decides that it is a language keyword (and not a number or similar), why should you "weaken" this information by splitting something you already successfully detected up in parts ;)
So more generally: don't hesitate to let the tokenizer output as large building-blocks as possible - as long as you do not give them any more meaning, which should be left to the parser.
Upvotes: 3
Reputation: 881423
The whole point of a tokeniser is to take your input stream (of characters) and give you tokens that you can use for grammatical analysis.
Hence you would expect the tokeniser to give you something along the lines of:
T_KEYWORD_VAR
T_VARIABLE(val)
T_KEYWORD_EQUALS
T_INTEGER(123)
T_KEYWORD_SEMICOLON
Upvotes: 4