Reputation: 1803
I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?
If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.
Upvotes: 2
Views: 1579
Reputation: 5893
There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:
And, of course, there are the techniques used by W3.org and cited by @jim mcnamara at http://www.w3.org/2005/03/23-lex-U.
You say you have written your own lexer in C, but you have used the tag lex
for the tool called lex
; perhaps that was an oversight?
In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:
while ( not <<EOF>> ) {
switch ( input_symbol ) {
case ( state_symbol[0] ) :
...
case ( state_symbol[1] ) :
...
default:
....
}
}
If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.
Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.
Upvotes: 3