metro-man
metro-man

Reputation: 1803

How do I lex unicode characters in C?

I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?

If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.

Upvotes: 2

Views: 1579

Answers (1)

There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:

  • JavaCC - JavaCC generates lexical analyzers written in Java.
  • JFLex - A lexical analyzer generator for Java.
  • Quex - A fast universal lexical analyzer generator for C and C++.
  • FsLex - A lexer generator for byte and Unicode character input for F#

And, of course, there are the techniques used by W3.org and cited by @jim mcnamara at http://www.w3.org/2005/03/23-lex-U.

You say you have written your own lexer in C, but you have used the tag lex for the tool called lex; perhaps that was an oversight?

In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:

while ( not <<EOF>> ) {
  switch ( input_symbol ) {
    case ( state_symbol[0] ) :
         ...
    case ( state_symbol[1] ) :

        ...
    default:
        ....
   }
}

If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.

Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.

Upvotes: 3

Related Questions