Reputation: 13

Inconsistent output of unicode strings with print and format()

I read a value from a database query that produces a unicode string. For reasons that are irrelevant here, the data-entry person entered the string value into the database as: "Assessor’s Parcel" (note the 'backward' apostrophe). I'm writing code that's just going through selected database records and printing out text. I use the .format() operation to insert the text from the variable into the printout. As we all know, .format fails when handed a unicode string. So, to reduce this to the conundrum, I submit the following example:

>>> a = u"Assessor’s Parcel"
>>> a
u'Assessor\u2019s Parcel'
>>> print a
Assessor’s Parcel
>>> "{0}".format(a)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
>>>

The above lines are from the 'Interactive Window' of PythonWin (PythonWin 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32.)

Why does 'print a' produce a different output than just 'a'? And why, if either of those can produce a reasonable output, .format() can't?

If I determine that I can't output unicode text (for some as yet unknown reason) and that I would be content with output that contains the "\u" syntax, then do I really have to wrap all my string outputs from database values in some code (method or otherwise) that does the conversion?

Upvotes: 1

Answers (4)

mertyildiran

Reputation: 6613

Here is a few attempts of mine to print properly. print a.encode('utf-8') seems like the solution:

>>> a = u"Assessor’s Parcel"
>>> a
u'Assessor\u2019s Parcel'

>>> print a
Assessor’s Parcel

>>> "{0}".format(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)

>>> a.encode('utf-8')
'Assessor\xe2\x80\x99s Parcel'

>>> print a..encode('utf-8')
  File "<stdin>", line 1
    print a..encode('utf-8')
            ^
SyntaxError: invalid syntax

>>> print a.encode('utf-8')
Assessor’s Parcel

>>> print a.encode('utf-8')
Assessor’s Parcel

>>> print a..encode('utf-8')
  File "<stdin>", line 1

    print a..encode('utf-8')
            ^
SyntaxError: invalid syntax

>>> a.encode('utf-8')
'Assessor\xe2\x80\x99s Parcel'

>>> print a.encode('utf-8')
Assessor’s Parcel

Upvotes: 0

Chad S.

Reputation: 6631

Just use unicode! (notice that your error is the first example on that HOWTO)

The issue isn't with format, it's with the fact that you're trying to put a unicode object into a bytestring and so it's trying to encode it (using the default encoding which is ascii). If instead you tried to format it into a unicode literal there would be no problem..

>>> a = u"Assessor’s Parcel"
>>> '{}'.format(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
>>> u'{}'.format(a)
u'Assessor\u2019s Parcel'
>>> print u'{}'.format(a)
Assessor’s Parcel
>>>

It would also not be an issue if you were putting a bytestring into a bytestring.

>>> '{}'.format(a.encode('utf8'))
'Assessor\xe2\x80\x99s Parcel'
>>> print '{}'.format(a.encode('utf8'))
Assessor’s Parcel
>>>

But that makes it more difficult to output to another (different) encoding later.

Upvotes: 2

MacFreek

Reputation: 3446

In the interactive shell, 'a' does print the representation of a. You can achieve the same with print repr(a)

print a will print str(a) to stdout. print will always encode the output the whatever the encoding of the output is. So print a is similar to sys.stdout.write(a.encode(sys.stdout.encoding) + "\n")

Please note the difference between u"string" and "string". The first is a Unicode string - a sequence of a Unicode code points, while the later is a binary string - a sequence of bytes. Python 3 makes a much more rigid distinction between the two (I actually prefer Python 3 since it is more picky, and thus better at telling me what I did wrong)

In "{0}".format(a), "{0}" is a binary string. You try to format a unicode string with non-ascii characters in that binary string. That failed because you need to tell Python how to convert from Unicode to binary string. So you can do: "{0}".format(a.encode('utf-8')).

However, you may not want a formatted binary string, but instead a formatted Unicode string. In that case, you can write: u"{0}".format(a)

Upvotes: 0

Prune

Reputation: 77850

Simply 'a' asks for the "most raw" form of the value, from the repr method of the class. Print drives that value through the str() conversion. The format expression sends it through yet a different conversion, one that is currently working in ASCII.

Upvotes: 0

Inconsistent output of unicode strings with print and format()

Answers (4)

Related Questions