Reputation: 43

When to use Unicode (aside with non-unicode!)

I haven't found much (concise) info about when exactly to use Unicode. I understand that many say best practice is to always use Unicode. But Unicode strings DO have more memory footprint. Am I correct to say that Unicode must be used only when

Printing something to screen other than local (for example debugging) use.
Generally, sending any type of text across a network with the two ends being in different locales/country
When you're not sure which to use

I think it would be beneficial if someone explained the basics (concise) of what actually happens with Unicode... am I correct to say that things get messy when :

the physical (byte) string gets sent to a machine using a representation of strings (code page, others... this is already detail although interesting) different from the sender.

The context is using Unicode in a programming language (say C++), but I hope answers to this question can be used for any encoding situation.
Also, I'm aware Unicode and NLS are not the same thing, but is it correct to say that NLS implies usage of Unicode?

P.S. awesome site

Upvotes: 3

Answers (3)

Sodved

Reputation: 8588

I have a little utility which is sometimes helpful in seeing the difference between character encodings. http://sodved.awardspace.info/unicode.pl. If you paste in ö into the Raw (UTF-8) field you will see that it is represented by different byte sequences in different encodings. And as the other two good answers describe, some non-unicode encodings cannot represent it at all.

Upvotes: 1

deceze

Reputation: 522081

The beauty of Unicode is that it frees you from restrictions and lots of headaches. Unicode is the largest character set available to date, i.e. it enables you to actually encode and use virtually any character of any halfway mainstream language in use today. With any other character set you need to think about whether it can actually encode a character or not. Latin-1 cannot encode the character "あ", Shift-JIS cannot encode the character "ڥ" and so on. Only if you're very sure you will never ever need anything other than basic Latin/Arabic/Japanaese/whatever other subset of characters should you choose a specialized encoding such as Latin-1, BIG-5, Shift-JIS or ASCII.

Unicode is the most versatile charset available and therefore a good standard to adhere to.

The Unicode encodings are nothing special, they're just a little more complex in their bit representation since they have to encode many more characters while still trying to be space efficient. For a very detailed excursion into this topic, please see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Upvotes: 4

Joachim Sauer

Reputation: 308031

Always use Unicode, it will save you and others a lot of pain.

What you may have confused is the issue of encoding. Unicode strings do not necessarily take more memory than the equivalent ASCII (or other encoding) strings, that depends a lot on the encoding used.

Sometimes "Unicode" is used as a synonym for "UCS-2" or "UTF-16". Strictly speaking that use is wrong, because "Unicode" is the standard that defines the set of characters and their unicode codepoints. It does not as such define a mapping to bytes (or words). UTF-16, UTF-8 and other encoding take over the job of mapping the characters to concrete bytes.

Upvotes: 5

When to use Unicode (aside with non-unicode!)

Answers (3)

Related Questions