Reputation: 13
I read a value from a database query that produces a unicode string. For reasons that are irrelevant here, the data-entry person entered the string value into the database as: "Assessor’s Parcel" (note the 'backward' apostrophe). I'm writing code that's just going through selected database records and printing out text. I use the .format() operation to insert the text from the variable into the printout. As we all know, .format fails when handed a unicode string. So, to reduce this to the conundrum, I submit the following example:
>>> a = u"Assessor’s Parcel"
>>> a
u'Assessor\u2019s Parcel'
>>> print a
Assessor’s Parcel
>>> "{0}".format(a)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
>>>
The above lines are from the 'Interactive Window' of PythonWin (PythonWin 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32.)
Why does 'print a' produce a different output than just 'a'? And why, if either of those can produce a reasonable output, .format() can't?
If I determine that I can't output unicode text (for some as yet unknown reason) and that I would be content with output that contains the "\u" syntax, then do I really have to wrap all my string outputs from database values in some code (method or otherwise) that does the conversion?
Upvotes: 1
Views: 533
Reputation: 6613
Here is a few attempts of mine to print properly. print a.encode('utf-8')
seems like the solution:
>>> a = u"Assessor’s Parcel"
>>> a
u'Assessor\u2019s Parcel'
>>> print a
Assessor’s Parcel
>>> "{0}".format(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
>>> a.encode('utf-8')
'Assessor\xe2\x80\x99s Parcel'
>>> print a..encode('utf-8')
File "<stdin>", line 1
print a..encode('utf-8')
^
SyntaxError: invalid syntax
>>> print a.encode('utf-8')
Assessor’s Parcel
>>> print a.encode('utf-8')
Assessor’s Parcel
>>> print a..encode('utf-8')
File "<stdin>", line 1
print a..encode('utf-8')
^
SyntaxError: invalid syntax
>>> a.encode('utf-8')
'Assessor\xe2\x80\x99s Parcel'
>>> print a.encode('utf-8')
Assessor’s Parcel
Upvotes: 0
Reputation: 6631
Just use unicode! (notice that your error is the first example on that HOWTO)
The issue isn't with format, it's with the fact that you're trying to put a unicode object into a bytestring and so it's trying to encode it (using the default encoding which is ascii). If instead you tried to format it into a unicode literal there would be no problem..
>>> a = u"Assessor’s Parcel"
>>> '{}'.format(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 8: ordinal not in range(128)
>>> u'{}'.format(a)
u'Assessor\u2019s Parcel'
>>> print u'{}'.format(a)
Assessor’s Parcel
>>>
It would also not be an issue if you were putting a bytestring into a bytestring.
>>> '{}'.format(a.encode('utf8'))
'Assessor\xe2\x80\x99s Parcel'
>>> print '{}'.format(a.encode('utf8'))
Assessor’s Parcel
>>>
But that makes it more difficult to output to another (different) encoding later.
Upvotes: 2
Reputation: 3446
In the interactive shell, 'a' does print the representation of a. You can achieve the same with print repr(a)
print a
will print str(a)
to stdout. print
will always encode the output the whatever the encoding of the output is. So print a
is similar to sys.stdout.write(a.encode(sys.stdout.encoding) + "\n")
Please note the difference between u"string"
and "string"
. The first is a Unicode string - a sequence of a Unicode code points, while the later is a binary string - a sequence of bytes. Python 3 makes a much more rigid distinction between the two (I actually prefer Python 3 since it is more picky, and thus better at telling me what I did wrong)
In "{0}".format(a)
, "{0}"
is a binary string. You try to format a unicode string with non-ascii characters in that binary string. That failed because you need to tell Python how to convert from Unicode to binary string. So you can do: "{0}".format(a.encode('utf-8'))
.
However, you may not want a formatted binary string, but instead a formatted Unicode string. In that case, you can write: u"{0}".format(a)
Upvotes: 0
Reputation: 77850
Simply 'a' asks for the "most raw" form of the value, from the repr method of the class. Print drives that value through the str() conversion. The format expression sends it through yet a different conversion, one that is currently working in ASCII.
Upvotes: 0