Reputation: 649

Why UTF-8 encoding doesn't need a Byte Order Mark?

Unicode FAQ mentions that UTF-8 doesn't need BOM.

Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian?

A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order.

For code points above U+0744, UTF-8 needs 2 to 4 bytes to represent them. Doesn't it need a BOM to specify the endianness of these bytes or does UTF-8 adopt a default?

Upvotes: 5

Answers (2)

Remy Lebeau

Reputation: 598309

UTF-8 uses 1-byte code units, so there is no need for a BOM to indicate a byte order, because there is only 1 byte order possible, and the encoding algorithm determines the ordering of the bytes. For example, U+0744 is encoded in UTF-8 as code units 0xDD 0x84, which are represented in bytes as DD 84. Bytes 84 DD would be an illegal UTF-8 sequence.

Unlike UTF-16 and UTF-32, which use 2-byte and 4-byte code units, respectively. The encoding algorithm determines the order of the code units, but since the code units themselves are multi-byte, they are subject to endian. For example, U+0744 is encoded in UTF-16 as code unit 0x0744, and in UTF-32 as code unit 0x00000744, which are represented in bytes as 07 44 or 44 07 in UTF-16, and as 07 44 00 00 or 00 00 44 07 in UTF-32, depending on endian.

So, a BOM makes sense to indicate which endian is actually being used for UTF-16/32, but not for UTF-8.

Upvotes: 4

Joni

Reputation: 111389

UTF-8 gives a strict definition for the order of the bytes that encode a character. No variation between computing platforms is allowed.

For example, the Euro sign U+20AC must be encoded as the byte sequence \xE2\x82\xAC. No other ordering of these bytes is permitted.

Upvotes: 6

Why UTF-8 encoding doesn&#39;t need a Byte Order Mark?

Answers (2)

Related Questions

Why UTF-8 encoding doesn't need a Byte Order Mark?