Reputation: 649
Unicode FAQ mentions that UTF-8 doesn't need BOM.
Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian?
A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order.
For code points above U+0744
, UTF-8 needs 2 to 4 bytes to represent them. Doesn't it need a BOM to specify the endianness of these bytes or does UTF-8 adopt a default?
Upvotes: 5
Views: 1236
Reputation: 595961
UTF-8 uses 1-byte code units, so there is no need for a BOM to indicate a byte order, because there is only 1 byte order possible, and the encoding algorithm determines the ordering of the bytes. For example, U+0744 is encoded in UTF-8 as code units 0xDD 0x84
, which are represented in bytes as DD 84
. Bytes 84 DD
would be an illegal UTF-8 sequence.
Unlike UTF-16 and UTF-32, which use 2-byte and 4-byte code units, respectively. The encoding algorithm determines the order of the code units, but since the code units themselves are multi-byte, they are subject to endian. For example, U+0744 is encoded in UTF-16 as code unit 0x0744
, and in UTF-32 as code unit 0x00000744
, which are represented in bytes as 07 44
or 44 07
in UTF-16, and as 07 44 00 00
or 00 00 44 07
in UTF-32, depending on endian.
So, a BOM makes sense to indicate which endian is actually being used for UTF-16/32, but not for UTF-8.
Upvotes: 4
Reputation: 111249
UTF-8 gives a strict definition for the order of the bytes that encode a character. No variation between computing platforms is allowed.
For example, the Euro sign U+20AC must be encoded as the byte sequence \xE2\x82\xAC
. No other ordering of these bytes is permitted.
Upvotes: 6