user10332687
user10332687

Reputation:

List of BOM characters

Is there a list of possible BOM characters that are used? So far I have encountered:

\x00\x00\xfe\xff    UTF-32, big-endian
\xff\xfe\x00\x00    UTF-32, little-endian
\xfe\xff            UTF-16, big-endian
\xff\xfe            UTF-16, little-endian
\xef\xbb\xbf        UTF-8

Are there any additional ones that I'm missing?

Upvotes: 2

Views: 2978

Answers (1)

J Quinn
J Quinn

Reputation: 349

Short answer: no, you've covered them.

According to the Unicode spec, UTF-8, UTF-16, and UTF-32 are the 3 general types of encodings. They actually list UTF-16, UTF-16LE, and UTF-16BE as separate encodings, and similarly UTF-32, UTF-32LE, and UTF-32BE.

It's important to know that if the character stream is explicitly coded in one of the LE or BE forms, you must interpret the leading 0xFFFE as U+FEFF Zero Width No-Break Space. I.e.

UTF-16BE  initial FE FF is treated as U+FEFF
UTF-16LE  initial FF FE is treated as U+FEFF
UTF-32BE  initial 00 00 FE FF is treated as U+FEFF
UTF-32LE  initial FF FE 00 00 is treated as U+FEFF

See http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G2212 for more details.

Upvotes: 3

Related Questions