Reputation: 4753
The below are few cases I observed. Like to know why Python's print is behaving like this, and possible fixes.
>>> print "%s" % u"abc" # works
>>> print "%s" % "\xd1\x81" # works
>>> print "%s %s" % (u"abc", "\xd1\x81") # Error
For the above (last), I'm getting: UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)
But, this works
>>> print "%s %s" % ("abc", "\xd17\x81") # works
And when I do
>>> print "%s %s" % (u"abc", u"\u0441") # Error
Its raising UnicodeEncodeError: 'charmap' codec can't encode character u'\u0441' in position 4: character maps to <undefined>
Upvotes: 0
Views: 1244
Reputation: 178115
When you mix Unicode strings and byte strings in Python 2, the byte strings are implicitly coerced to Unicode using the default ascii
codec. You will get UnicodeDecodeError
if this fails.
When you print Unicode strings, they are implicitly encoded in the current output encoding. You will get UnicodeEncodeError
if this fails.
So:
>>> print "%s" % u"abc"
is really:
>>> print unicode("%s",'ascii') % u"abc" # and valid
But the following only works if you mean "doesn't throw an error". If you expect it to print U+0441 character it will do so only if the output encoding is UTF-8. It prints garbage on my Windows system.
>>> print "%s" % "\xd1\x81"
The following gives error because of the implicit Unicode decoding:
print "%s %s" % (u"abc", "\xd1\x81")
which is really:
print unicode("%s %s",'ascii') % (u"abc", unicode("\xd1\x81",'ascii'))
\xd1
and 0x81
are outside the ASCII range of 0-7Fh.
The last error implies that your output encoding is not UTF-8, because it couldn't encode \u0441
to a character supported by the output encoding for printing. UTF-8 can encode all Unicode characters.
Upvotes: 2
Reputation: 515
This is correct. When you output, you have to encode your unicode object to the desired character encoding, i.e. utf-8
or whatever. Think of unicode
(including all u"" literals) as an abstraction that has to be encoded to something like utf-8
prior to serialisation.
You can encode a unicode
object s
to utf-8
with s.encode('utf-8')
. str
objects in Python 2 are byte-encoded, therefore you do not get an error with things like "\xd17\81", they are already encoded.
I would recommend you to use Python 3 rather than Python 2 where this is a bit more intuitive.
Upvotes: 0