Surya Kasturi
Surya Kasturi

Reputation: 4753

Python print failing to print Unicode and string same time

The below are few cases I observed. Like to know why Python's print is behaving like this, and possible fixes.

>>> print "%s" % u"abc" # works
>>> print "%s" % "\xd1\x81" # works
>>> print "%s %s" % (u"abc", "\xd1\x81") # Error

For the above (last), I'm getting: UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 0: ordinal not in range(128)

But, this works

>>> print "%s %s" % ("abc", "\xd17\x81") # works

And when I do

>>> print "%s %s" % (u"abc", u"\u0441") # Error

Its raising UnicodeEncodeError: 'charmap' codec can't encode character u'\u0441' in position 4: character maps to <undefined>

Upvotes: 0

Views: 1244

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 178115

When you mix Unicode strings and byte strings in Python 2, the byte strings are implicitly coerced to Unicode using the default ascii codec. You will get UnicodeDecodeError if this fails.

When you print Unicode strings, they are implicitly encoded in the current output encoding. You will get UnicodeEncodeError if this fails.

So:

>>> print "%s" % u"abc"

is really:

>>> print unicode("%s",'ascii') % u"abc" # and valid

But the following only works if you mean "doesn't throw an error". If you expect it to print U+0441 character it will do so only if the output encoding is UTF-8. It prints garbage on my Windows system.

>>> print "%s" % "\xd1\x81"

The following gives error because of the implicit Unicode decoding:

print "%s %s" % (u"abc", "\xd1\x81")

which is really:

print unicode("%s %s",'ascii') % (u"abc", unicode("\xd1\x81",'ascii'))

\xd1 and 0x81 are outside the ASCII range of 0-7Fh.

The last error implies that your output encoding is not UTF-8, because it couldn't encode \u0441 to a character supported by the output encoding for printing. UTF-8 can encode all Unicode characters.

Upvotes: 2

proycon
proycon

Reputation: 515

This is correct. When you output, you have to encode your unicode object to the desired character encoding, i.e. utf-8 or whatever. Think of unicode (including all u"" literals) as an abstraction that has to be encoded to something like utf-8 prior to serialisation.

You can encode a unicode object s to utf-8 with s.encode('utf-8'). str objects in Python 2 are byte-encoded, therefore you do not get an error with things like "\xd17\81", they are already encoded.

I would recommend you to use Python 3 rather than Python 2 where this is a bit more intuitive.

Upvotes: 0

Related Questions