Reputation: 50220
It's not a real problem in practice, since I can just write BOM = "\uFEFF"
; but it bugs me that I have to hard-code a magic constant for such a basic thing. [Edit: And it's error prone! I had accidentally written the BOM as \uFFFE
in this question, and nobody noticed. It even led to an incorrect proposed solution.] Surely python defines it in a handy form somewhere?
Searching turned up a series of constants in the codecs
module: codecs.BOM
, codecs.BOM_UTF8
, and so on. But these are bytes
objects, not strings. Where is the real BOM?
This is for python 3, but I would be interested in the Python 2 situation for completeness.
Upvotes: 2
Views: 816
Reputation: 27734
I suppose you could use:
unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')
but it's not as clean as what you already have
Upvotes: 1
Reputation: 21249
There isn't one. The bytes
constants in codecs
are what you should be using.
This is because you should never see a BOM in decoded text (i.e., you shouldn't encounter a string that actually encodes the code point U+FEFF
). Rather, the BOM exists as a byte pattern at the start of a stream, and when you decode some bytes
with a BOM, the U+FEFF
isn't included in the output string
. Similarly, the encoding process should handle adding any necessary BOM to the output bytes
---it shouldn't be in the input string
.
The only time a BOM matters is when either converting into or converting from bytes
.
Upvotes: 1