Dan Brenner
Dan Brenner

Reputation: 898

Confusion over UTF-16 and UTF-32

From what I understand, the main difference between UTF-16 and UTF-32 is that UTF-32 is always four bytes per character, while UTF-16 is sometimes one byte and sometimes two bytes per character. This gives UTF-16 the advantage of taking up less memory than UTF-32, but UTF-32 has the advantage of constant time access of the n'th character.

My question is, if you can represent every unicode character with at most two bytes as done in UTF-16, then why isn't there a format that always uses two bytes to encode each character? This format, while being slightly more memory expensive than UTF-16, would be strictly better than UTF-32 by allowing constant time access while using half the memory.

What is my misunderstanding here?

Upvotes: 0

Views: 280

Answers (3)

Kerrek SB
Kerrek SB

Reputation: 477474

You got it a bit wrong:

  1. Unicode defines values (code points) up to 0x110000, i.e. 221. Once 0x10FFFF has been reached, new encoding schemes will be needed, but there is tons of unused code points so Unicode has plenty of room to expand for the foreseeable future before hitting that limit.

  2. UTF-32 uses 32-bit code units. Since every code point currently defined is less than 0x10FFFF, every code point fits in 1 code unit.

  3. UTF-16 uses 16-bit code units. Its encoding scheme uses 1 code unit for code points below 0x10000, and two code units (known as a surrogate pair) for the remaining code points. UTF-16 is designed to encode code points up to 0x10FFFF.

  4. UTF-8 uses 8-bit code units. Its encoding scheme uses anywhere between 1-4 code units to represent a code point, depending on its value. The original encoding scheme used to allow up to 6 code units for code points up to 0x7FFFFFFF, but was later restricted to 4 code units so that code points above 0x10FFFF, which are not representable in UTF-16, are illegal in UTF-8 to allow for loss-less conversions between UTF-8 and UTF-16.

Upvotes: 6

dan04
dan04

Reputation: 91209

why isn't there a format that always uses two bytes to encode each character?

There is; it's called UCS-2.

The problem is, a straight 16-bit format only lets you represent 216 = 65 536 code points. This was enough for Unicode 1.0 (whose goal was “to encompass the characters of all the world's living languages”), but then the scope of the project expanded to include historical scripts like Egyptian hieroglyphs, and the 16-bit limit became too limiting.

So, the Unicode Consortium decided to add 16 supplementary planes with room for a million new characters, expanding the upper limit of the code space from U+FFFF to U+10FFFF. Simultaneously, the “surrogate pair” mechanism of UTF-16 was invented so that platforms which had already been built around UCS-2 (notably, Windows NT and the Java programming language) could represent the additional code points.

Upvotes: 1

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201818

UTF-16 uses two bytes for characters in Plane 0, the Basic Multilingual Plane (BMP), U+0000...U+FFFF, and four bytes for any other character. You cannot represent all Unicode characters in two bytes.

Upvotes: 1

Related Questions