Reputation: 86135
UTF encodings have non-character codes, and I need to handle these exceptions. I know there are plenty of libraries does this, but I think I need to know fundamental principles.
What should I care when transcode Unicode code-point into/from UTF or UCS encodings? I think each encodings have different rules, but three should be a simple principals. I want to know that.
Update
I posted this question because I was trying to extract a Unicode code point (not a UTF-16 character) from NSString
. NSString
supports only UTF-16 manner API for character handling, so I need to perform extra processing to get actual code point (which is actually meaningful). My program should
But the problem is I am not sure that the surrogate pairs are the only stuffs to care on UTF-16. I think there should be more stuffs to care, and I want to know that. And if possible, also on other encodings too. Of course, only if it's simple enough to handle. If it's incredible complex, I will just use libraries like libICU
.
I know libICU
will give me that features, but currently, it feels somewhat over-engineering to me. If I know the basic rules (for example, "surrogate pairs are the only things to care!"), at least prohibiting unsupported character should be very easy and simple.
Upvotes: 0
Views: 141
Reputation: 1330
There's a method on NSString
enumerateSubstringsInRange:options:usingBlock:
where you can specify NSStringEnumerationByComposedCharacterSequences as the options: and this will give you a series of NSRange values that specify the composed Unicode code points. So for most code points which fit into a single unichar (i.e. 16-bits) the NSRange will cover a single index into the NSString, but e.g. for Unicode code points of Emoji characters the NSRange will cover multiple unichars.
Upvotes: 1