What should I consider to convert between Unicode code-point and UTF8/16/32 or anything?

Question

UTF encodings have non-character codes, and I need to handle these exceptions. I know there are plenty of libraries does this, but I think I need to know fundamental principles.

What should I care when transcode Unicode code-point into/from UTF or UCS encodings? I think each encodings have different rules, but three should be a simple principals. I want to know that.

Update

I posted this question because I was trying to extract a Unicode code point (not a UTF-16 character) from NSString. NSString supports only UTF-16 manner API for character handling, so I need to perform extra processing to get actual code point (which is actually meaningful). My program should

promote surrogate pair characters correctly
or prohibit them for reliable character handling.

But the problem is I am not sure that the surrogate pairs are the only stuffs to care on UTF-16. I think there should be more stuffs to care, and I want to know that. And if possible, also on other encodings too. Of course, only if it's simple enough to handle. If it's incredible complex, I will just use libraries like libICU.

I know libICU will give me that features, but currently, it feels somewhat over-engineering to me. If I know the basic rules (for example, "surrogate pairs are the only things to care!"), at least prohibiting unsupported character should be very easy and simple.

Jonathan Caryl · Accepted Answer

There's a method on NSString

enumerateSubstringsInRange:options:usingBlock:

where you can specify NSStringEnumerationByComposedCharacterSequences as the options: and this will give you a series of NSRange values that specify the composed Unicode code points. So for most code points which fit into a single unichar (i.e. 16-bits) the NSRange will cover a single index into the NSString, but e.g. for Unicode code points of Emoji characters the NSRange will cover multiple unichars.

What should I consider to convert between Unicode code-point and UTF8/16/32 or anything?

Answers (1)

Related Questions