Char and bytes in python

Question

In reading this tutorial I came across the following difference between __unicode__ and __str__ method:

Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.

How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:

Ω (omega symbol)
03 A9 or u'\u03a9'

In python, would this be considered one character (Ω) and two bytes, or two characters(03 A9) and two bytes? Or maybe I am confusing the difference between char and character ?

chepner · Accepted Answer

In Python, u'\u03a9' is a string consisting of the single Unicode character Ω (U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.

One source of ambiguity is a string like 'é', which could either be the single character U+00E9 or the two-character string U+0065 U+0301.

>>> len(u'\u00e9'); print(u'\u00e9')
1
é
>>> len(u'\u0065\u0301'); print(u'\u0065\u0301')
2
é

The two-byte sequence '\xce\xa9', however, can be interpret as the UTF-8 encoding of U+03A9.

>>> u'\u03a9'.encode('utf-8')
'\xce\xa9'

>>> '\xce\xa9'.decode('utf-8')
u'\u03a9'

In Python 3, that would be (with UTF-8 being the default encoding scheme)

>>> '\u03a9'.encode()
b'\xce\xa9'
>>> b'\xce\xa9'.decode()
'Ω'

Other byte sequences can be decoded to U+03A9 as well:

>>> b'\xff\xfe\xa9\x03'.decode('utf16')
'Ω'
>>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32')
'Ω'

Char and bytes in python

Answers (1)

Related Questions