ELLIOTTCABLE
ELLIOTTCABLE

Reputation: 18108

Where can I get started with Unicode-friendly programming in C?

So, I’m working on a plain-C (ANSI 9899:1999) project, and am trying to figure out where to get started re: Unicode, UTF-8, and all that jazz.

Specifically, it’s a language interpreter project, and I have two primary places where I’ll need to handle Unicode: reading in source files (the language ostensibly supports Unicode identifiers and such), and in ‘string’ objects.

I’m familiar with all the obvious basics about Unicode, UTF-7/8/16/32 & UCS-2/4, so on and so forth… I’m mostly looking for useful, C-specific (that is, please no C++ or C#, which is all that’s been documented here on SO previously) resources as to my ‘next steps’ to implement Unicode-friendly stuff… in C.

Any links, manpages, Wikipedia articles, example code, is all extremely welcome. I’ll also try to maintain a list of such resources here in the original question, for anybody who happens across it later.


Upvotes: 8

Views: 959

Answers (3)

Geoff Reedy
Geoff Reedy

Reputation: 36071

GLib has some Unicode functions and is a pretty lightweight library. It's not near the same level of functionality that ICU provides, but it might be good enough for some applications. The other features of GLib are good to have for portable C programs too.

GTK+ is built on top of GLib. GLib provides the fundamental algorithmic language constructs commonly duplicated in applications. This library has features such as (this list is not a comprehensive list):

  • Object and type system
  • Main loop
  • Dynamic loading of modules (i.e. plug-ins)
  • Thread support
  • Timer support
  • Memory allocator
  • Threaded Queues (synchronous and asynchronous)
  • Lists (singly linked, doubly linked, double ended)
  • Hash tables
  • Arrays
  • Trees (N-ary and binary balanced)
  • String utilities and charset handling
  • Lexical scanner and XML parser
  • Base64 (encoding & decoding)

Upvotes: 3

pm100
pm100

Reputation: 50210

I think one of the interesting questions is - what should your canonical internal format for strings be? The 2 obvious choices (to me at least) are

a) utf8 in vanilla c-strings b) utf16 in unsigned short arrays

In previous projects I have always chosen utf-8. Why ; because its the path of least resistance in the C world. Everything you are interfacing with (stdio, string.h etc) will work fine.

Next comes - what file format. The problem here is that its visible to your users (unless you provide the only editor for your language). Here I guess you have to take what they give you and try to guess by peeking (byte order marks help)

Upvotes: 0

Geoff Reedy
Geoff Reedy

Reputation: 36071

International Components for Unicode provides a portable C library for handling unicode. Here's their elevator pitch for ICU4C:

The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).

Upvotes: 10

Related Questions