Brian M. Hunt
Brian M. Hunt

Reputation: 83828

is unicode( codecs.BOM_UTF8, "utf8" ) necessary in Python 2.7/3?

In a code review I came across the following code:

# Python bug that renders the unicode identifier (0xEF 0xBB 0xBF)
# as a character.
# If untreated, it can prevent the page from validating or rendering 
# properly. 
bom = unicode( codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

This is in a function that passes a string to Response object (Django or Flask).

Is this still a bug that needs this fix in Python 2.7 or 3? Something tells me it isn't, but I thought I'd ask because I don't know this problem very well.

I'm not sure where this came from, but I've seen it around the Internet, referenced sometimes in association with Jinja2 (which we are using).

Thanks for reading.

Upvotes: 6

Views: 2992

Answers (2)

ekhumoro
ekhumoro

Reputation: 120698

The Unicode standard states that the character \ufeff has two distinct meanings. At the start of a data stream, it should be used as a byte-order and/or encoding signature, but elsewhere it should be interpreted as a zero-width non-breaking space.

So the code

bom = unicode(codecs.BOM_UTF8, "utf8" )
r = r.replace(bom, '')

isn't just removing the utf-8 encoding signature (aka BOM) - it's also removing any embedded zero-width non-breaking spaces.

Some earlier versions of python did not have a variant of the "utf-8" codec which skips the BOM when reading data streams. Since this was inconsistent with the other other unicode codecs, a "utf-8-sig" codec was introduced with version 2.5, which does skip the BOM.

So it's possible the "Python bug" mentioned in the code comments relates to that.

However, it seems more likely that the "bug" relates to embedded \ufeff characters. But since the Unicode Standard clearly states they can be interpreted as legitimate characters, it is really up to the data consumer to decide how to treat them - and therefore not a bug in python.

Upvotes: 7

sorin
sorin

Reputation: 170628

BOM is a byte sequence that specifies what Unicode encoding is used.

BOM is used to inform the decoder how to transform bytes to Unicode (where Unicode can have different binary representation).

It doesn't make any sense to try to put BOM inside a Unicode string.

Upvotes: 0

Related Questions