Reputation: 110093
In reading this tutorial I came across the following difference between __unicode__
and __str__
method:
Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2:
__unicode__
. In Python 2,__str__
returns bytes, whereas__unicode__
returns characters.
How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:
Ω (omega symbol)
03 A9 or u'\u03a9'
In python, would this be considered one character (Ω
) and two bytes, or two characters(03 A9
) and two bytes? Or maybe I am confusing the difference between char
and character
?
Upvotes: 2
Views: 12674
Reputation: 530872
In Python, u'\u03a9'
is a string consisting of the single Unicode character Ω
(U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.
One source of ambiguity is a string like 'é'
, which could either be the single character U+00E9 or the two-character string U+0065 U+0301.
>>> len(u'\u00e9'); print(u'\u00e9')
1
é
>>> len(u'\u0065\u0301'); print(u'\u0065\u0301')
2
é
The two-byte sequence '\xce\xa9'
, however, can be interpret as the UTF-8 encoding of U+03A9.
>>> u'\u03a9'.encode('utf-8')
'\xce\xa9'
>>> '\xce\xa9'.decode('utf-8')
u'\u03a9'
In Python 3, that would be (with UTF-8 being the default encoding scheme)
>>> '\u03a9'.encode()
b'\xce\xa9'
>>> b'\xce\xa9'.decode()
'Ω'
Other byte sequences can be decoded to U+03A9 as well:
>>> b'\xff\xfe\xa9\x03'.decode('utf16')
'Ω'
>>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32')
'Ω'
Upvotes: 3