Reputation:
Is there a list of possible BOM characters that are used? So far I have encountered:
\x00\x00\xfe\xff UTF-32, big-endian
\xff\xfe\x00\x00 UTF-32, little-endian
\xfe\xff UTF-16, big-endian
\xff\xfe UTF-16, little-endian
\xef\xbb\xbf UTF-8
Are there any additional ones that I'm missing?
Upvotes: 2
Views: 2978
Reputation: 349
Short answer: no, you've covered them.
According to the Unicode spec, UTF-8, UTF-16, and UTF-32 are the 3 general types of encodings. They actually list UTF-16, UTF-16LE, and UTF-16BE as separate encodings, and similarly UTF-32, UTF-32LE, and UTF-32BE.
It's important to know that if the character stream is explicitly coded in one of the LE or BE forms, you must interpret the leading 0xFFFE as U+FEFF Zero Width No-Break Space. I.e.
UTF-16BE initial FE FF is treated as U+FEFF
UTF-16LE initial FF FE is treated as U+FEFF
UTF-32BE initial 00 00 FE FF is treated as U+FEFF
UTF-32LE initial FF FE 00 00 is treated as U+FEFF
See http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G2212 for more details.
Upvotes: 3