Reputation: 4877
I'm writing an HTML parser in C, and am looking to correctly follow the W3C guidelines on parser implementation. One of the key points is that the parser operates on a stream of Unicode Code Points rather than bytes, which makes sense.
Basically, then, given a buffer of known character encoding (I will either be given an explicit input encoding, or will use the HTML5 prescan algorithm to make a good guess), what's the best way in C — ideally cross-platform, but sticking to UNIX is fine — to iterate over an equivalent sequence of Unicode Code Points?
Is alloc'ing a few reasonably-sized buffers and using iconv
the way to go? Should I be looking at ICU? The macros like U16_NEXT
seem to be well-suited to my task, but the ICU documentation is incredibly long-winded, and it's a little hard to see exactly how to glue things together.
Upvotes: 6
Views: 416
Reputation: 183
The following will decode a code point and return how much to increment the string by (how much was "chewed"). Note that xs_utf16 is an unsigned short. More info: http://sree.kotay.com/2006/12/unicode-is-pain-in.html
enum
{
xs_UTF_Max = 0x0010FFFFUL,
xs_UTF_Replace = 0x0000FFFDUL,
xs_UTF16_HalfBase = 0x00010000UL,
xs_UTF16_HighStart = 0x0000D800UL,
xs_UTF16_HighEnd = 0x0000DBFFUL,
xs_UTF16_LowStart = 0x0000DC00UL,
xs_UTF16_LowEnd = 0x0000DFFFUL,
xs_UTF16_MaxUCS2 = 0x0000FFFFUL,
xs_UTF16_HalfMask = 0x000003FFUL,
xs_UTF16_HalfShift = 10
};
int32 xs_UTF16Decode (uint32 &code, const xs_utf16* str, int32 len, bool strict)
{
if (str==0||len==0) {code=0; return 0;}
uint32 c1 = str[0];
//note: many implementations test from HighStart to HighEnd,
// this may be a partial code point, and is incorrect(?)
// trivial checking should exclude the WHOLE surrogate range
if (c1<xs_UTF16_HighStart || c1>xs_UTF16_LowEnd) return 1;
//really an error if we're starting in the low range
//surrogate pair
if (len<=1 || str[1]==0) {code=xs_UTF_Replace; return strict ? 0 : 1;} //error
uint32 c2 = str[1];
code = ((c1-xs_UTF16_HighStart)<<xs_UTF16_HalfShift) + (c2-xs_UTF16_LowStart) + xs_UTF16_HalfBase;
if (strict==false) return 2;
//check for errors
if (c1>=xs_UTF16_LowStart && c1<=xs_UTF16_LowEnd) {code=xs_UTF_Replace; return 0;} //error
if (c2<xs_UTF16_LowStart || c2>xs_UTF16_LowEnd) {code=xs_UTF_Replace; return 0;} //error
if (code>xs_UTF_Max) {code=xs_UTF_Replace; return 0;} //error
//success
return 2;
}
Upvotes: 2
Reputation: 20140
Two things may be of interest to you:
Upvotes: 0
Reputation: 22337
ICU is a good choice. I used it with C++ and liked it a lot. I am quite sure you get similar thought-through APIs in C as well.
Not totally the same but somewhat related might be this tutorial that explains how to perform streaming/incremental transliteration (the difficulty in this case is that the "cursor" may be inside a code point sometimes).
Upvotes: 2