Reputation: 40489
I am writing my first parser and have a few questions conerning the tokenizer.
Basically, my tokenizer exposes a nextToken()
function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:
<
, :=
, (
and the likeNow, do you think this makes sense?
Also, I am struggling with the NUMBER
token-type. Do you think it makes more sense to further split it up into a NUMBER
and a FLOAT
token-type? Without a FLOAT
token-type, I'd receive NUMBER
(eg 402), a SYMBOL
(.) followed by another NUMBER
(eg 203) if I were about to parse a float.
Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909
? Should it return the SYMBOL
-
first, followed by the NUMBER
909
or should it return a NUMBER
-909
right away?
Upvotes: 6
Views: 3304
Reputation: 26868
From my experience with actual lexers:
"hello world"
is a string literal, parse it as a single token. If -3.058e18
is a float literal, parse it as a single token as well. Lexers usually rely on regular expressions, which are expressive enough for all these things, and more. Of course, if the literals are complex enough you have to split them (e.g. the block literal in Smalltalk).Upvotes: 3
Reputation: 15669
It depends upon your target language.
The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER
when it sees a symbol that matches "[0-9]+". If it sees a non-integer number, such as "3.1415926" it will return NUMBER
.
NUMBER
. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren't relevant to your parser) then you could end up incorrectly parsing things like "123 /* comment / . \n / other comment */ 456" as floating point numbers.
As for lexing "-[0-9]+" as a NUMBER
vs MINUS
NUMBER
again, that depends upon your target language, but I would usually go with MINUS
NUMBER
, otherwise you would end up lexing "A = 1-2-3-4" as SYMBOL
=
NUMBER
NUMBER
NUMBER
NUMBER
instead of SYMBOL
=
NUMBER
MINUS
NUMBER
MINUS
NUMBER
MINUS
NUMBER
.
While we're on the topic, I'd strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.
Upvotes: 4
Reputation: 132
I think that the answer to your question is strictly tied to the semantic of NUMBER. What a NUMBER should be? An always positive integer, a float...
I'd like to suggest you to lookup to the flex and yacc (aka lex & bison) tools of the U**x operating systems: these are powerful parsers and scanners generators that take a grammar and output a compilable and readily usable program.
Upvotes: 2
Reputation: 6776
It depends on how you are taking in tokens, if you are doing it character by character, then it might be a bit tricky, but if you are doing it word by word i.e.
int a = a + 2.0
then the tokens would be (discarding whitespace)
int
a
=
a
+
2.0
So you wouldn't run into the situation where you interpret the .
as at token but rather take the whole string in - which is where you can determine if it's a FLOAT or NUMBER or whatever you want.
Upvotes: 0
Reputation: 18998
You are best served by making your token types closely match your grammar's terminal symbols.
Without knowing the language/grammar, I expect you would be better served by having token types for "LESS_THAN", "LESS_THAN_OR_EQUAL" and also "FLOAT", "DOUBLE", "INTEGER", etc.
Upvotes: 4