Reputation: 16147
So I'm teaching myself character encoding, and I have a presumably stupid question: Wikipedia says
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), ...
, and a chart on that page writes
Encoding Representation (hexadecimal)
UTF-8 EF BB BF
UTF-16 (BE) FE FF
UTF-16 (LE) FF FE
...
I'm a little confused by it. As I know, most machines using Intel CPUs are little-endian, so why BOM is U+FE FF
for UTF-16 (BE), rather than U+EF BB BF
for UTF-8 or U+FF FE
for UTF-16 (LE)?
Upvotes: 0
Views: 2007
Reputation: 598299
As I know, most machines using Intel CPUs are little-endian
Intel CPUs are not the only CPUs used in the world. AMD, ARM, etc. And there are big-endian CPUs.
why BOM is U+FE FF for UTF-16 (BE), rather than U+EF BB BF for UTF-8 or U+FF FE for UTF-16 (LE)?
U+FEFF
is the Unicode codepoint designation. FE FF
, EF BB BF
, FF FE
, these are sequences of bytes instead. U+
only applies to Unicode codepoint designations, not bytes.
The numeric value of Unicode codepoint U+FEFF ZERO WIDTH NO-BREAK SPACE
(which is its official designation, not U+FEFF BYTE ORDER MARK
, though it is also used as a BOM) is 0xFEFF
(65279).
That codepoint value encoded in UTF-8 produces three 8-bit codeunit values 0xEF 0xBB 0xBF
, which are not subject to any endian issues, which is why UTF-8 does not have separate LE and BE variants.
That same codepoint value encoded in UTF-16 produces one 16-bit codeunit value 0xFEFF
. Because it is a multi-byte (16-bit) value, it is subject to endian when interpreted as two 8-bit bytes, hence the LE (0xFF 0xFE
) and BE (0xFE 0xFF
) variants.
It is not just the BOM that is effected. All codeunits in a UTF-16 string are affected by endian. The BOM helps a decoder know the endian used for the codeunits in the entire string.
UTF-32, which also uses multi-byte (32-bit) codeunits, is also subject to endian, and thus it also has LE and BE variants, and a 32-bit BOM to express that endian to decoders (0xFF 0xFE 0x00 0x00
for LE, 0x00 0x00 0xFE 0xFF
for BE). And yes, as you can probably guess, there is an ambiguity between the UTF-16LE BOM and the UTF-32LE BOM, if you don't know ahead of time which UTF you are dealing with. A BOM is meant to identify the endian, hence the name "Byte Order Mark", not the particular encoding (though it is commonly used for that purpose).
Upvotes: 3
Reputation: 536755
why BOM is U+FE FF for UTF-16 (BE)
It isn't. BOM is character number U+FEFF. There's no space, it's a single hexadecimal number, aka 65279. This definition does not depend on what sequence of bytes is used to represent that characters in any particular encoding.
It happens that the hexadecimal representation of the byte sequence that encodes the character(*) in UTF-16LE, 0xFE, 0xFF
has the same order of digits as the hexadecimal representation of the character number U+FEFF
; this is just an artefact of big-endianness, it puts most-significant content on the left, same as humans do for big [hexa]decimal numbers.
(* and indeed any character in the Basic Multilingual Plane. It gets hairier when you go above this range as they no longer fit in two bytes.)
Upvotes: 2