Reputation:

should I use utf-8 or utf-16 or utf-32 for my multilingual cms?

Besides the difference in how characters are stored, are there any special characters in any language utf-32 can display and utf-8 cannot?

Upvotes: 2

Answers (4)

Venkateswara Rao

Reputation: 5392

1) UTF-8 can be backward compatible with ASCII for regular english characters, this can be an advantage when your client just have english characters.

2) UTF-8 is good in saving network bandwidth if you have ASCII characters more than non-English characters.

3) UTF-16 would be good if you have more non-English characters in terms of saving Storage space.

I suggest to use UTF-8 based on #1 above.

Upvotes: 0

user541686

Reputation: 210455

Is there any character one of them can't represent?

In theory: No.

All of those formats can represent all Unicode code points.

In practice: Depends.

The Windows API uses UCS-2 (which is pretty much the first UTF-16 chunk) and doesn't always handle surrogates correctly. So you might want to use UTF-16 to have your program act as "normal" as possible compared to other programs, instead of truncating high-ranging UTF-32 code points manually.

Anything else?

Yes: Use UTF-8!

It's endian-less, so you it avoids byte-order issues, which are a pain in the rear.
Of course, if you're on Windows then you need to convert to UTF-16 before using them.

Upvotes: 1

socha23

Reputation: 10239

UTF-8, UTF-16 and UTF-32 all can be used to represent all Unicode datapoints. So no, there are no special characters that can be represented in UTF-32 and not in UTF-8.

Upvotes: 0

Sean Owen

Reputation: 66886

All UTF encodings can represent the same range of code points (0 to 0x10FFFF). So, the same characters can be encoded by any of them.

Whether they can be "displayed" is an entirely different question. That's nothing to do with the encoding, and a function of the font family used. I am not sure that any font has glyphs for every single Unicode code point. But I assume you meant "represented".

They do vary in how many bytes they'll need to represent a given string. UTF-8 is almost always the shortest for non-Asian languages. For those, UTF-16 might win (I haven't really "benchmarked".) I can't imagine a realistic case where UTF-32 would be optimal.