ndquangr
ndquangr

Reputation: 455

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.

For example, I want to get character 'đ' which has Unicode value of c491, but after this code:

NSString *str = [NSString stringWithUTF8String:"\uc491"];

The value of str is not 'đ' but '쓉' (a Korean word) instead.

I also used:

int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];

But the results of two above pieces of code are the same.

I can't understand what is problem here, please help!

Upvotes: 1

Views: 1636

Answers (1)

nhahtdh
nhahtdh

Reputation: 56829

The short answer

To specify đ, you can specify it in the following ways (untested):

@"đ"
@"\u0111"
@"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]

Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct @"...".

As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.

The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.

The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.

Unicode escape sequences (Universal character names in C99)

According to this blog1:

Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.

Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:

A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (`), nor one in the range D800 through DFFF inclusive.

Character set vs. Character encoding

It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.

A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]

A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]

There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.

Footnote

1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Upvotes: 3

Related Questions