Reputation: 371
I am trying to treat extended ascii characters with lex, e.g., àÀ
.
%{
#include <stdio.h>
%}
DIGIT [0-9]
ALPHA_CHAR [A-Za-z]
EXTENDED [àÀ]
CHAR {ALPHA_CHAR}|{DIGIT}|{EXTENDED}
CHARS ({CHAR})+
%%
{CHARS} { printf("CHARS: %s\n", yytext); }
. { printf("Unknown character: %s\n", yytext); }
%%
int main(int argc, char **argv) {
yylex();
return 0;
}
int yywrap() {
return 1;
}
When I am giving as input àÀ
, my code prints à À
, as à
is encoded 16 UTF8 bits 0xC3 = Ã
and 0xA0 = NBSP
, and so is À
: 0xC3 = Ã
and 0x80 = €
.
What I could do is if I detect 0xC3
, expect for a second byte, and add an appropriate offset to this character, to get the ASCII equivalent.
The offset would be = 0xC0 - 0x80
, as 0xC0
is À
is 0xC0
in ASCII, and 0xC380
in hex.
But I find this idea kind of dirty.
Any better ideas to handle this issue?
Upvotes: 0
Views: 35