Reputation: 4898
Why does Python add \xe3
in the output of:
>>> b'Transa\xc3\xa7\xc3\xa3o'.decode('utf-8')
'Transaç\xe3o'
Expected value is:
'Transação'
Some more information about my environment:
>>> import sys
>>> print (sys.version)
3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)]
>>> sys.stdout.encoding
'cp437'
This was under Console 2 + Powershell.
Upvotes: 1
Views: 4015
Reputation: 1123400
You need to use a console or terminal that supports all of the characters that you want to print.
When printing in the interactive console, the characters are encoded to the correct codec for your console, with any character that is not supported using the backslashreplace
error handler to keep the output readable rather than throw an exception. This is a feature of the default sys.displayhook()
function:
If
repr(value)
is not encodable tosys.stdout.encoding
withsys.stdout.errors
error handler (which is probably'strict'
), encode it tosys.stdout.encoding
with'backslashreplace'
error handler.
Your console can handle ç
but not ã
. There are several codecs that include the first character but not the last; you are using IBM codepage 437, but it is by no means the only one.
If you are running Python in the standard Windows console (cmd.exe
) then be aware that Python, Unicode and that console do not mix very well. You can install the win-unicode-console
package to make Python 3 use the Windows APIs to better output Unicode text; you'll need to make sure you have a font capable of displaying your Unicode text still.
I don't know for certain if that package is compatible with other Windows shells; your mileage may vary.
Upvotes: 5