Why BOM is U+FE FF, rather than U+FF FE?

Question

So I'm teaching myself character encoding, and I have a presumably stupid question: Wikipedia says

The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), ...

, and a chart on that page writes

Encoding      Representation (hexadecimal)  
UTF-8         EF BB BF  
UTF-16 (BE)   FE FF  
UTF-16 (LE)   FF FE  
...

I'm a little confused by it. As I know, most machines using Intel CPUs are little-endian, so why BOM is U+FE FF for UTF-16 (BE), rather than U+EF BB BF for UTF-8 or U+FF FE for UTF-16 (LE)?

Remy Lebeau · Accepted Answer

As I know, most machines using Intel CPUs are little-endian

Intel CPUs are not the only CPUs used in the world. AMD, ARM, etc. And there are big-endian CPUs.

why BOM is U+FE FF for UTF-16 (BE), rather than U+EF BB BF for UTF-8 or U+FF FE for UTF-16 (LE)?

U+FEFF is the Unicode codepoint designation. FE FF, EF BB BF, FF FE, these are sequences of bytes instead. U+ only applies to Unicode codepoint designations, not bytes.

The numeric value of Unicode codepoint U+FEFF ZERO WIDTH NO-BREAK SPACE (which is its official designation, not U+FEFF BYTE ORDER MARK, though it is also used as a BOM) is 0xFEFF (65279).

That codepoint value encoded in UTF-8 produces three 8-bit codeunit values 0xEF 0xBB 0xBF, which are not subject to any endian issues, which is why UTF-8 does not have separate LE and BE variants.

That same codepoint value encoded in UTF-16 produces one 16-bit codeunit value 0xFEFF. Because it is a multi-byte (16-bit) value, it is subject to endian when interpreted as two 8-bit bytes, hence the LE (0xFF 0xFE) and BE (0xFE 0xFF) variants.

It is not just the BOM that is effected. All codeunits in a UTF-16 string are affected by endian. The BOM helps a decoder know the endian used for the codeunits in the entire string.

UTF-32, which also uses multi-byte (32-bit) codeunits, is also subject to endian, and thus it also has LE and BE variants, and a 32-bit BOM to express that endian to decoders (0xFF 0xFE 0x00 0x00 for LE, 0x00 0x00 0xFE 0xFF for BE). And yes, as you can probably guess, there is an ambiguity between the UTF-16LE BOM and the UTF-32LE BOM, if you don't know ahead of time which UTF you are dealing with. A BOM is meant to identify the endian, hence the name "Byte Order Mark", not the particular encoding (though it is commonly used for that purpose).

Why BOM is U+FE FF, rather than U+FF FE?

Answers (2)

Related Questions