Making lex to read UTF-8 doesn't work

Question

I wrote an xml parser that parses ASCII files, but I need now to be able to read UTF-8 encoded files. I have the following regex in lex but they don't match UTF-8. I am not sure what I am doing wrong:

utf_8       [\x00-\xff]*
bom         [\xEF\xBB\xBF]

then:

bom             { fprintf( stderr, "OMG I SAW A BOM"); return BOM;}
utf_8           { fprintf( stderr, "OMG I SAW A UTF CHAR", yytext[0] ); return UTF_8;}

I also have the following grammar rules:

program 
: UTF8 '<' '?'ID attribute_list '?''>' 
root ...

where UTF8 is:

UTF8

: BOM           {printf("i saw a bom
");}
| UTF_8         {printf("i saw a utf
");}
|               {printf("i didn't see anything.'
");} 
;

It always comes up with i didn't see anything, my parser works for ASCII files, that is when I copy paste the XML UTF-8 file in a empty document.

Any help would be appreciated.

EDIT:

Here is a trimmed .l file for reference:

%{
#include 
#include 
#include 
#include "y.tab.h"
int lines = 1;
%}

utf_8       [\x0000-\xffff]*
bom         [\xEF\xBB\xBF]
whitespace  [ 	]
ev          (.|{bom})
ev1         (.|{utf_8})
%%
{whitespace}    { fprintf( stderr, "%s", yytext );}

              { fprintf( stderr, "%s%d ", yytext, lines++ );}
.               { fprintf( stderr, "{TOKEN:%c}", yytext[0] ); return yytext[0];}
bom             { fprintf( stderr, "OMG I SAW A BOM"); return BOM;}
utf_8           { fprintf( stderr, "OMG I SAW A UTF CHAR", yytext[0] ); return UTF_8;}
%%

void error( char *message )
{
    fprintf( stderr, "Error: %s
", message );
    exit(1);
}

zwol · Accepted Answer

Okay, this is your problem:

utf_8       [\x0000-\xffff]*
bom         [\xEF\xBB\xBF]

There are two problems here. First, Flex doesn't actually understand Unicode. It works on bytes. So you need a regex macro that matches any valid UTF-8 byte sequence. http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex has a regular expression that does that, which is not terribly hard to convert to Flex syntax (see below). Second, the square brackets in your BOM macro are making it match any single byte with value EF, BB, or BF, not the three-byte sequence EB BB BF which is what you want.

(Incidentally, UTF-8 files are not supposed to have byte order marks, although many do anyway.)

Here is a complete Flex input file that does more or less what you appear to have been trying to do:

%{
#include 
%}

bom     \xEF\xBB\xBF
white   [ 	]

u2a     [\xC2-\xDF][\x80-\xBF]
u2b     \xE0[\xA0-\xBF][\x80-\xBF]
u3a     [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
u3b     \xED[\x80-\x9F][\x80-\xBF]
u4a     \xF0[\x90-\xBF][\x80-\xBF]{2}
u4b     [\xF1-\xF3][\x80-\xBF]{3}
u4c     \xF4[\x80-\x8F][\x80-\xBF]{2}

utf_8   {u2a}|{u2b}|{u3a}|{u3b}|{u4a}|{u4b}|{u4c}

%%

{white}     { putchar(' ');  }

          { putchar('
'); }
{bom}       { putchar('B');  }
{utf_8}     { putchar('u');  }
[\x21-\x7e] { putchar('.');  }
.           { putchar('^');  }

Making lex to read UTF-8 doesn't work

Answers (1)

Related Questions

Making lex to read UTF-8 doesn&#39;t work

Answers (1)

Related Questions

Making lex to read UTF-8 doesn't work