Reputation: 151
The following code runs fine with Python3 on my Windows machine and prints the character 'é':
data = b"\xc3\xa9"
print(data.decode('utf-8'))
However, running the same on an Ubuntu based docker container results in :
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
Is there anything that I have to install to enable utf-8 decoding ?
Upvotes: 4
Views: 7892
Reputation: 6044
Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:
Thus the recommended way seems to be to tell your python instance to use utf-8
as default encoding:
Set your default encoding of python source files via environment variable:
export PYTHONIOENCODING=utf8
Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....
Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command
with open(fname, "rt", encoding="utf-8") as f:
...
and there's a more hackish way with some side effects, but saves you to explicitly specify it each time
import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')
Please read the warnings about this hack in the related answer and comments.
Upvotes: 6
Reputation: 5817
The problem is with the print()
expression, not with the decode()
method.
If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.
Whenever you use the print()
function, Python converts its arguments to a str
and subsequently encodes the result to bytes
, which are sent to the terminal (or whatever Python is run in).
The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment.
In an ideal case,
In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter. These are a few options to address this problem:
LC_ALL
to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.Re-encode STDOUT, like so:
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
The encoding used has to match the one of the terminal.
sys.stdout
, eg. sys.stdout.buffer.write("é".encode('utf8'))
. This is of course much more boilerplate than print("é")
. Again, the encoding used has to match the one of the terminal.print()
altogether. Use open(fn, encoding=...)
for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).There might be other options, but I doubt that there are nicer ones.
Upvotes: 4