Reputation: 1492
I'm trying to understand the difference between "lexeme" and "token" in compilers.
If the lexer part of my compiler encounters the following sequence of characters in the source code to be compiled.
"abc"
is it correct to say that the above is a lexeme that is 5 characters long?
If my compiler is implemented in C, and I allocate space for a token for this lexeme, the token will be an struct. The first member of the struct will be an int
which will have the type from some enum, in this case STRING_LITERAL. The second member of the struct will be a char *
that points to some (dynamically allocated) memory that has 4 bytes. The first byte is 'a'
, the second 'b'
, the third 'c'
, and the fourth is NULL
to terminate the string.
So...
The lexeme is 5 character of the source code text.
The token is a total of 6 bytes in memory.
Is that the correct way to use the terminology?
(I'm ignoring tokens tracking meta data like filename, line number, and column number.)
Sort of related question:
Is it uncommon practice to have the lexer convert an integer lexeme into an integer value in a token? Or is it better (or more standard) to store the characters of the lexeme in a token and let the parser stage convert those characters to an integer node to be attached to the AST?
Upvotes: 3
Views: 3019
Reputation: 1403
A "lexeme" is a literal character in the source, for example 'a' is a lexeme in "abc". It is the smallest unit. The "lexer" or lexical analysis stage converts lexemes into tokens(such as keywords, identifiers, literals, operators etc) which are the smallest units the parser can use to create ASTs. So if we have the statement
int x = 0;
The lexer would output
<type:int> <id: x> <operator: = > <literal: 0> <semicolon>
The lexer is typically a collection of regular expressions that can simply define collections of characters as what would be terminals in the languages grammar. These are turned into tokens which is feed into the parser as a stream.
However, most people use lexeme and token interchangeably, and it usually doesn't cause confusion. For you question about converting the int literal, you would want a wrapper class for your AST. Just having a integer alone might not be enough information.
Upvotes: 3