How to interpret unicode encodings

Question

I just finished reading the article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.

I'd really appreciate clarification on this part of the article.

OK, so say we have a string: Hello which, in Unicode, corresponds to these five code points: U+0048 U+0065 U+006C U+006C U+006F...That’s where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

My questions:

Why could the two zero's at the beginning of 0048 be moved to the end?

What is FE FF and FF FE, what's the difference between them and how were they used? (Yes I tried googling those terms, but I'm still confused)

Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?

Also, I'd appreciate any recommended resources to learn more about this stuff.

weibeld · Accepted Answer

Summary: the 0xFEFF (byte-order mark) character is used to solve the endianness problem for some character encodings. However, most of today's character encodings are not prone to the endianness problem, and thus the byte-order mark is not really relevant for today.

Why could the two zero's at the beginning of 0048 be moved to the end?

If two bytes are used for all characters, then each character is saved in a 2-byte data structure in the memory of the computer. Bytes (groups of 8 bits) are the basic addressable units in most computer memories, and each byte has its own address. On systems that use the big-endian format, the character 0x0048 would be saved in two 1-byte memory cells in the following way:

 n    n+1
+----+----+
| 00 | 48 |
+----+----+

Here, n and n+1 are the addresses of the memory cells. Thus, on big-endian systems, the most significant byte is stored in the lowest memory address of the data structure.

On a little-endian system, on the other hand, the character 0x0048would be stored in the following way:

 n    n+1
+----+----+
| 48 | 00 |
+----+----+

Thus, on little-endian systems, the least significant byte is stored in the lowest memory address of the data structure.

So, if a big-endian system sends you the character 0x0048 (for example, over the network), it sends you the byte sequence 00 48. On the other hand, if a little-endian system sends you the character 0x0048, it sends you the byte sequence 48 00.

So, if you receive a byte sequence like 00 48, and you know that it represents a 16-bit character, you need to know whether the sender was a big-endian or little-endian system. In the first case, 00 48 would mean the character 0x0048, in the second case, 00 48 would mean the totally different character 0x4800.

This is where the FE FF sequence comes in.

What is FE FF and FF FE, what's the difference between them and how were they used?

U+FEFF is the Unicode byte-order mark (BOM), and in our example of a 2-byte encoding, this would be the 16-bit character 0xFEFF.

The convention is that all systems (big-endian and little-endian) save the character 0xFEFF as the first character of any text stream. Now, on a big-endian system, this character is represented as the byte sequence FE FF (assume memory addresses increasing from left to right), whereas on a little-endian system, it is represented as FF FE.

Now, if you read a text stream, that has been created by following this convention, you know that the first character must be 0xFEFF. So, if the first two bytes of the text stream are FE FF, you know that this text stream has been created by a big-endian system. On the other hand, if the first two bytes are FF FE, you know that the text stream has been created by a little-endian system. In either case, you can now correctly interpret all the 2-byte characters of the stream.

Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?

Placing the byte-order mark (BOM) character 0xFEFF at the beginning of each text stream is just a convention, and not all systems may follow it. So, if the BOM is missing, you have the problem of not knowing whether to interpret the 2-byte characters as big-endian or little-endian.

Also, I'd appreciate any recommended resources to learn more about this stuff.

Notes:

Today, the most widely used Unicode-compatible encoding is UTF-8. UTF-8 has been designed to avoid the endianness problem, thus, the entire byte-order mark 0xFEFF stuff is not relevant for UTF-8 (see here).

The byte-order mark is however relevant to the other Unicode-compatible encodings UTF-16 and UTF-32, which are prone to the endianness problem. If you browse through the list of available encodings, for example in the settings of a text editor or terminal, you see that there are big-endian and little-endian versions of UTF-16 and UTF-32, typically called UTF-16BE and UTF-16LE, or UTF-32BE and UTF-32LE, respectively. However, UTF-16 and UTF-32 are rarely used in practice.

Other popular encodings used today include the encodings from the ISO 8859 series, such as ISO 8859-1 (and the derived Windows-1252), known as Latin-1, or also the pure ASCII encoding. However, all these are single-byte encodings, that is, each character is encoded to 1 byte and saved in a 1-byte data structure. Thus, the endianness problem doesn't apply here, and the byte-order mark story is also not relevant for these cases.

All in all, the endianness problem for character encodings, that you struggled to understand, has thus mostly a historical value, and is not really relevant for today's world anymore.

How to interpret unicode encodings

Answers (2)

Related Questions