alexis
alexis

Reputation: 50220

Unicode Byte Order Mark (BOM) as a python constant?

It's not a real problem in practice, since I can just write BOM = "\uFEFF"; but it bugs me that I have to hard-code a magic constant for such a basic thing. [Edit: And it's error prone! I had accidentally written the BOM as \uFFFE in this question, and nobody noticed. It even led to an incorrect proposed solution.] Surely python defines it in a handy form somewhere?

Searching turned up a series of constants in the codecs module: codecs.BOM, codecs.BOM_UTF8, and so on. But these are bytes objects, not strings. Where is the real BOM?

This is for python 3, but I would be interested in the Python 2 situation for completeness.

Upvotes: 2

Views: 816

Answers (2)

Alastair McCormack
Alastair McCormack

Reputation: 27734

I suppose you could use:

unicodedata.lookup('ZERO WIDTH NO-BREAK SPACE')

but it's not as clean as what you already have

Upvotes: 1

一二三
一二三

Reputation: 21249

There isn't one. The bytes constants in codecs are what you should be using.

This is because you should never see a BOM in decoded text (i.e., you shouldn't encounter a string that actually encodes the code point U+FEFF). Rather, the BOM exists as a byte pattern at the start of a stream, and when you decode some bytes with a BOM, the U+FEFF isn't included in the output string. Similarly, the encoding process should handle adding any necessary BOM to the output bytes---it shouldn't be in the input string.

The only time a BOM matters is when either converting into or converting from bytes.

Upvotes: 1

Related Questions