Reputation: 2508
I am trying to build my own compiler which outputs the type of input the user gives, for example, abcd
is an identifier, and 1242
is an integer. I have implemented it as below:
textProg.l
%{
#define IDENTIFIER 10
#define INTEGER 11
%}
IDENTIFIER [a-zA-Z_][a-zA-Z0-9_]*
INTEGER [1-9][0-9]*|"0"
%%
{IDENTIFIER} { return IDENTIFIER; }
{INTEGER} { return INTEGER; }
%%
int main() {
int token;
while(token = yylex()) {
if(token == IDENTIFIER) { printf("IDENTIFIER"); }
else if(token == INTEGER) { printf("INTEGER"); }
else { printf("INVALID"); }
}
}
This works perfectly when I run the following commands:
flex testProg.l
cc lex.yy.c -lfl
./a.out
Sample working input
sample
IDENTIFIER
1993
INTEGER
The problem arises when I try to input an invalid token, for example 12abc
. This is neither an integer nor an identifier and should output "INVALID" but it outputs:
12abc
INTEGER
IDENTIFIER
What happened is that 12
and abc
are taken as separate tokens instead of one. How can I avoid this?
Upvotes: 0
Views: 256
Reputation: 241671
Many languages use lexical analysers which are perfectly happy to let 12abc
be an integer followed by a identifier. Why not? If that means something in the language, then that's probably what the user meant. If it doesn't mean anything, it will trigger a syntax error, so the user will be informed.
But, OK, you want to recognise that as an error. In that case you need to recognise the erroneous input as an error, and the first step is to recognise it as a token. That's easy if you remember flex's match precedences:
[[:alpha:]_][[:alnum:]_]* { return IDENTIFIER; }
[1-9][[:digit:]]*|0 { return NUMBER; }
[[:alnum:]_]+ { return BADTOKEN; }
Note that I replaced your macros with actual patterns, using named character classes for readability, and removed the redundant quotes on "0"
.
Upvotes: 2
Reputation: 4838
Flex parses 12abc
as two separate tokens because you didn't tell it it shouldn't.
Lex derivatives, like Flex, works by one very simple but effective algorithm: They start at the position when the last token ended (or at beginning of the text) and try to find a rule that matches the most characters from this point. (If there are multiple rules that match the same number of characters, the one defined in the "*.l" file first is chosen.) That's it. Notice there is nothing about it having to match a whole word.
That's actually a good thing. It is why in most programming languages you don't need to explicitly separate tokens. You can write things like (2+30L)/2
and the lexer for that language will figure out where each token ends, without additional hints like whitespaces. (The tokens would be (
, 2
, +
, 30
, L
, )
, /
and 2
.)
If you want to disable this fancy mechanism for the specific case of putting numbers and identifiers together, you will need to create a rule that explicitly forbids it, e.g:
{IDENTIFIER} { return IDENTIFIER; }
{INTEGER} { return INTEGER; }
[0-9A-Za-z_]+ { return ERROR; }
Notice that this new rule also matches valid identifiers and integers. However, it won't be used for them because it is under them on the rules list.
Upvotes: 1