Omprakash
Omprakash

Reputation: 86

Why is unicode string can't be compared to byte string in python?

From the Pattern python docs, I see 'Unicode String can't be compared with Byte String', but why? You can read the line here:https://github.com/python/cpython/blob/3.5/Lib/re.py

Upvotes: 1

Views: 90

Answers (1)

tripleee
tripleee

Reputation: 189307

Python 3 introduced a somewhat controversial change where all Python strings are Unicode strings, and all byte strings need to have an encoding specified before they can be converted to Unicode strings.

This goes with the Python principle of "explicit is better than implicit", and removes a large number of potential bugs where implicit conversion would quietly produce wrong or corrupt results when the programmer was careless or unaware of the implications.

The flip side of this is now that it's hard to write code which mixes Unicode and byte strings unless you properly understand the model. (Well, it was hard before, too; but programmers who were oblivious remained so, and thought their code worked until someone tested it properly. Now they get errors up front.)

Briefly, quoting from the Stack Overflow character-encoding tag info page:

just like changing the font from Arial to Wingdings changes what your text looks like, changing encodings affects the interpretation of a sequence of bytes. For example, depending on the encoding, the bytes 0xE2 0x89 0xA0 could represent the text ≠in Windows code page 1252, or Б┴═ in KOI8-R, or the character in UTF-8.

Python 2 would do some unobvious stuff under the hood to coerce this byte string into a native string, which depending on context might involve the local system's "default encoding", and thus produce different results on different systems, creating some pretty hard bugs. Python 3 requires you to explicitly say how the bytes should be interpreted if you want to convert them into a string.

bytestr = b'\xE2\x89\xA0' 
fugly = bytestr.decode('cp1252')  # u'≠'
cyril = bytestr.decode('koi8-r')  # u'Б┴═' 
wtf_8 = bytestr.decode('utf-8')   # u'≠'

Upvotes: 5

Related Questions