How to treat extended ASCII chars using lex?

Question

I am trying to treat extended ascii characters with lex, e.g., àÀ.

%{
#include 
%}

DIGIT           [0-9]
ALPHA_CHAR      [A-Za-z]
EXTENDED        [àÀ]

CHAR             {ALPHA_CHAR}|{DIGIT}|{EXTENDED}
CHARS            ({CHAR})+

%%
{CHARS}    { printf("CHARS: %s
", yytext); }
.          { printf("Unknown character: %s
", yytext); }
%%

int main(int argc, char **argv) {
    yylex();
    return 0;
}

int yywrap() {
    return 1;
}

When I am giving as input àÀ, my code prints Ã Ã€, as à is encoded 16 UTF8 bits 0xC3 = Ã and 0xA0 = NBSP, and so is À: 0xC3 = Ã and 0x80 = €.

What I could do is if I detect 0xC3, expect for a second byte, and add an appropriate offset to this character, to get the ASCII equivalent.

The offset would be = 0xC0 - 0x80, as 0xC0 is À is 0xC0 in ASCII, and 0xC380 in hex.

But I find this idea kind of dirty.

Any better ideas to handle this issue?

How to treat extended ASCII chars using lex?

Answers (0)

Related Questions