What happens when you call str() on a unicode string?

Question

I'm wondering what happens internally when you call str() on a unicode string.

# coding: utf-8
s2 = str(u'hello')

Is s2 just the unicode byte representation of the str() arg?

icktoofay · Accepted Answer

It will try to encode it with your default encoding. On my system, that's ASCII, and if there's any non-ASCII characters, it will fail:

>>> str(u'あ')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)

Note that this is the same error you'd get if you called encode('ascii') on it:

>>> u'あ'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)

As you might imagine, str working on some arguments and failing on others makes it easy to write code that on first glance seems to work, but stops working once you throw some international characters in there. Python 3 avoids this by making the problem blatantly obvious: you can't convert Unicode to a byte string without an explicit encoding:

>>> bytes(u'あ')
TypeError: string argument without an encoding

What happens when you call str() on a unicode string?

Answers (1)

Related Questions