Reputation: 11966
I have basic understanding of UTF8: code points have variable length, so a "character" can be 8 bits, 16 bits, or even longer.
What I'm wondering is if there some sample code, library, etc in C language that does similar things to an UTF8 string like standard library in C. E.g. tell the length of the string, etc.
Thanks,
Upvotes: 8
Views: 10406
Reputation: 7661
If you are interested in a library which doesn't allocate memory and uses the stack you could try utf8rewind.
Upvotes: 1
Reputation: 80384
GNU does have a Unicode string library, called libunistring, but it doesn’t handle anything nearly as well as ICU’s does.
For example, the GNU library doesn’t even give you access to collation, which is the basis for all string comparison. In contrast, ICU does. Another thing that ICU has that GNU doesn’t appear is Unicode regexes. For that, you might like to use Phil Hazel’s excellent PCRE library for C, which can be compiled with UTF-8 support.
However, it might be that the GNU library is enough for what you need. I don’t like its API much. Very messy. If you like C programming, you might try the Go programming language, which has excellent Unicode support. It’s a new language, but small and clean and fun to use.
On the other hand, the major interpreted languages — Perl, Python, and Ruby — all have varying support for Unicode that is better than you’ll ever get in C. Of those, Perl’s Unicode support is the most developed and robust.
Remember: it isn’t enough to support more characters. Without the rules that go with them, you don’t have Unicode. At most, you might have ISO 10646: a large character repertoire but no rules. My mantra is “Unicode isn’t just more characters; it’s more characters plus a whole bunch of rules for handling them.”
Upvotes: 4