Reputation: 29892
It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.
Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern.
Any suggestion?
EDIT
The most simple solution would be:
ANY [\x00-\xff]
and use 'ANY' instead of '.' in my rules.
Upvotes: 4
Views: 8055
Reputation:
writing an negatet characterclass might also help:
[\n \t] return WHITESPACE; [^\n \t] retrun NON_WHITESPACE
Upvotes: 1
Reputation: 28384
I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...
UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.
A common method so far is:
What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.
UB [\200-\277] %%
[\300-\337]{UB} { do something }
[\340-\357]{UB}{2} { do something }
[\360-\367]{UB}{3} { do something }
[\370-\373]{UB}{4} { do something }
[\374-\375]{UB}{5} { do something }
Taken from the mailing list.
I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.
hope this helps!
Upvotes: 7