Oak
Oak

Reputation: 26868

Is there any reason to prefer UTF-16 over UTF-8?

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.

Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?

EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.

Upvotes: 34

Views: 8752

Answers (8)

plugwash
plugwash

Reputation: 10484

Anyone knows why these languages chose UTF-16?

Short answer:

Because Sun and Microsoft were early adopters of Unicode.

Long answer:

Sometime in the late 1980s, people started to realise that a Universal Character set was desirable but what bit width to use was a matter of controversy. In 1989 ISO proposed a draft of ISO 10646 which offered multiple encoding modes, one mode would use 32-bits per character and be able to encode everything without any mode switching. Another would use 16 bits per character but have an escape system for switching. Yet another would use a byte based encoding that had a number of design flaws.

A number of major software vendors did not like the ISO 10646 draft, seeing it as too complicated. They backed an alternative scheme called Unicode. Unicode 1.0, a fixed width 16 bit encoding was published in October 1991. The software vendors were able to convince the national standards bodies to vote down the ISO 10646 draft and ISO were pushed into Unification with Unicode.

So that was where we were in the early 1990s, a number of major software vendors had collaborated to design a new fixed width encoding and were adopting it in their Flagship products, including Windows NT and Java.

Meanwhile, X/Open were looking for a better way to encode Unicode in extended Ascii contexts. ISO had proposed an encoding known as "UTF-1" but it sucked for several reasons, it was slow to process because it required calculations modulo a number that was not a power of 2, it was not self-synchronizing and shorter sequences could appear as sub-sequences of longer ones. These efforts resulted in what we now know as UTF-8 but they happened away from the main line of Unicode development. UTF-8 was developed in 1992 and presented in 1993 at a Usenix conference but it does not seem to have been considered a proper standard until 1996.

This is the Environment in which Windows NT (released July 1993) and Java (released January 1996) were designed and released. Unicode was a simple fixed width encoding and hence the obvious choice for an internal processing format. Java did adopt a modified form of UTF-8 but only as a storage format.

There was pressure to encode more characters and in July 1996 Unicode 2.0 was introduced. The code space was expanded to just over 20 bits and Unicode was no longer a fixed-width 16 bit encoding. Instead there was a choice of a fixed-width 32-bit encoding or variable width encodings with 8 and 16 bit units.

No-one wanted to take the space hit of 32-bit code units or the compatibility hit of changing their encoding unit size for the second time, so the systems that had been designed around the original 16 bit Unicode generally ended up using UTF-16. Sure there was some risk that text processing could miscount or mangle the new characters but meh, it was still the lesser evil.

The .net framework, and with it C# were introduced somewhat later in 2002, but by this time Microsoft was already deeply committed to 16 bit code units. Their operating system APIs used them. Their file systems used them. Their executable formats used them.

Unix and the Internet on the other hand, stayed largely byte-based all the way through. UTF-8 was treated as just another encoding, gradually replacing the previous legacy encodings.

Upvotes: 1

Aaron Muir Hamilton
Aaron Muir Hamilton

Reputation: 132

If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.

As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.

Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.

After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.

Upvotes: 3

Andrew Russell
Andrew Russell

Reputation: 27195

I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.

I imagine there are two main reasons why Windows NT uses UTF-16 internally:

  • For memory usage: UTF-32 wastes a lot of space to encode.
  • For performance: UTF-8 is much harder to decode than UTF-16. In UTF-16 characters are either a Basic Multilingual Plane character (2 bytes) or a Surrogate Pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes.

Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use StringInfo.GetTextElementEnumerator().

For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Upvotes: 8

NoozNooz42
NoozNooz42

Reputation: 4328

@Oak: this too long for a comment...

I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.

Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.

Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).

So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.

:)

Upvotes: 10

richj
richj

Reputation: 7529

UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.

UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.

Upvotes: 3

corvuscorax
corvuscorax

Reputation: 5930

It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.

Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.

Upvotes: 3

Dean Harding
Dean Harding

Reputation: 72638

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.

Upvotes: 39

to StackOverflow
to StackOverflow

Reputation: 124686

For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.

So you avoid all the complexity of variable-length encodings like UTF-8.

Upvotes: 1

Related Questions