Geeky Quentin
Geeky Quentin

Reputation: 2508

Can't identify the invalid identifier in my FLEX code

I am trying to build my own compiler which outputs the type of input the user gives, for example, abcd is an identifier, and 1242 is an integer. I have implemented it as below:

textProg.l

%{
    #define IDENTIFIER 10
    #define INTEGER    11
%}

IDENTIFIER    [a-zA-Z_][a-zA-Z0-9_]*
INTEGER       [1-9][0-9]*|"0"

%%

{IDENTIFIER} { return IDENTIFIER; }
{INTEGER}    { return INTEGER; }

%%

int main() {
    int token;
    
    while(token = yylex()) {
        if(token == IDENTIFIER) { printf("IDENTIFIER"); }
        else if(token == INTEGER) { printf("INTEGER"); }
        else { printf("INVALID"); }
    }
}

This works perfectly when I run the following commands:

flex testProg.l
cc lex.yy.c -lfl
./a.out

Sample working input

sample
IDENTIFIER
1993
INTEGER

The problem arises when I try to input an invalid token, for example 12abc. This is neither an integer nor an identifier and should output "INVALID" but it outputs:

12abc
INTEGER
IDENTIFIER

What happened is that 12 and abc are taken as separate tokens instead of one. How can I avoid this?

Upvotes: 0

Views: 256

Answers (2)

rici
rici

Reputation: 241671

Many languages use lexical analysers which are perfectly happy to let 12abc be an integer followed by a identifier. Why not? If that means something in the language, then that's probably what the user meant. If it doesn't mean anything, it will trigger a syntax error, so the user will be informed.

But, OK, you want to recognise that as an error. In that case you need to recognise the erroneous input as an error, and the first step is to recognise it as a token. That's easy if you remember flex's match precedences:

[[:alpha:]_][[:alnum:]_]*   { return IDENTIFIER; }
[1-9][[:digit:]]*|0         { return NUMBER; }
[[:alnum:]_]+               { return BADTOKEN; }

Note that I replaced your macros with actual patterns, using named character classes for readability, and removed the redundant quotes on "0".

Upvotes: 2

Piotr Siupa
Piotr Siupa

Reputation: 4838

Flex parses 12abc as two separate tokens because you didn't tell it it shouldn't.

Lex derivatives, like Flex, works by one very simple but effective algorithm: They start at the position when the last token ended (or at beginning of the text) and try to find a rule that matches the most characters from this point. (If there are multiple rules that match the same number of characters, the one defined in the "*.l" file first is chosen.) That's it. Notice there is nothing about it having to match a whole word.

That's actually a good thing. It is why in most programming languages you don't need to explicitly separate tokens. You can write things like (2+30L)/2 and the lexer for that language will figure out where each token ends, without additional hints like whitespaces. (The tokens would be (, 2, +, 30, L, ), / and 2.)

If you want to disable this fancy mechanism for the specific case of putting numbers and identifiers together, you will need to create a rule that explicitly forbids it, e.g:

{IDENTIFIER}  { return IDENTIFIER; }
{INTEGER}     { return INTEGER; }
[0-9A-Za-z_]+ { return ERROR; }

Notice that this new rule also matches valid identifiers and integers. However, it won't be used for them because it is under them on the rules list.

Upvotes: 1

Related Questions