Reputation: 4228

Python UnicodeDecodeError when writing German letters

I've been banging my head on this error for some time now and I can't seem to find a solution anywhere on SO, even though there are similar questions.

Here's my code:

f = codecs.open(path, "a", encoding="utf-8")
value = "Bitte überprüfen"
f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

And what I get as en error is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

Why ascii if I say utf-8? I would really appreciate any help.

Upvotes: 9

Answers (4)

user937284

Reputation: 2634

Like already suggested your error results from this line:

f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

it should be:

f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))

A note on unicode and encodings

If woking with Python 2, software should only work with unicode strings internally, converting to a particular encoding on output.

Do prevent from making the same error over and over again you should make sure you understood the difference between ascii and utf-8 encodings and also between str and unicode objects in Python.

The difference between ASCII and UTF-8 encoding:

Ascii needs just one byte to represent all possible characters in the ascii charset/encoding. UTF-8 needs up to four bytes to represent the complete charset.

ascii (default)
1    If the code point is < 128, each byte is the same as the value of the code point.
2    If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

utf-8 (unicode transformation format)
1    If the code point is <128, it’s represented by the corresponding byte value.
2    If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3    Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

The difference between str and unicode objects:

You can say that str is baiscally a byte string and unicode is a unicode string. Both can have a different encoding like ascii or utf-8.

str vs. unicode
1   str     = byte string (8-bit) - uses \x and two digits
2   unicode = unicode string      - uses \u and four digits
3   basestring
        /\
       /  \
    str    unicode

If you follow some simple rules you should go fine with handling str/unicode objects in different encodings like ascii or utf-8 or whatever encoding you have to use:

Rules
1    encode(): Gets you from Unicode -> bytes
     encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2    decode(): Gets you from bytes -> Unicode
     decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3    codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4    u”: Makes your string literals into Unicode objects rather than byte sequences.
5    unicode(string[, encoding, errors])

Warning: Don’t use encode() on bytes or decode() on Unicode objects

And again: Software should only work with Unicode strings internally, converting to a particular encoding on output.

Upvotes: 1

zmo

Reputation: 24802

For the sake of never being hurt by unicode errors ever again, switch to python3:

% python3
>>> with open('/tmp/foo', 'w') as f:
...     value = "Bitte überprüfen"
...     f.write(('"{}" = "{}";\n'.format('no_internet', value)))
... 
36
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo
"no_internet" = "Bitte überprüfen";

though if you're really tied to python2 and have no choice:

% python2
>>> with open('/tmp/foo2', 'w') as f:
...   value = u"Bitte überprüfen"
...   f.write(('"{}" = "{}";\n'.format('no_internet', value.encode('utf-8'))))
... 
>>> import sys
>>> sys.exit(0)
% cat /tmp/foo2
"no_internet" = "Bitte überprüfen";

And as @JuniorCompressor suggests, don't forget to add # encoding: utf-8 at the start of your python2 file to tell python to read the source file in unicode, not in ASCII!

Your error in:

f.write(("\"%s\" = \"%s\";\n" % ("no_internet", value)).encode("utf-8"))

is that you're encoding the whole formatted string into utf-8, whereas you shall encode the value string into utf-8 before doing the format:

>>> with open('/tmp/foo2', 'w') as f:
...    value = u"Bitte überprüfen"
...    f.write(('"{}" = "{}";\n'.format('no_internet', value).encode('utf-8')))
... 
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 6: ordinal not in range(128)

Which is because python needs to first decode the string into utf-8, so you have to use the unicode type (which is what u"" does). Then you need to explicitly decode that value as unicode before feeding it to the format parser, to build the new string.

As Karl says in his answer, Python2 is totally messy/buggy when using unicode strings, defeating the Explicit is better than implicit zen of python. And for more weird behaviour, the following works just fine in python2:

>>> value = "Bitte überprüfen"
>>> out = '"{}" = "{}";\n'.format('no_internet', value)
>>> out
'"no_internet" = "Bitte \xc3\xbcberpr\xc3\xbcfen";\n'
>>> print(out)
"no_internet" = "Bitte überprüfen";

Still not convinced to switch to python3 ? :-)

Update:

This is the way to go to read and write an unicode string from a file to another file:

 % echo "Bitte überprüfen" > /tmp/foobar
 % python2
>>> with open('/tmp/foobar', 'r') as f:
...     data = f.read().decode('utf-8').strip()
... 
>>> 
>>> with open('/tmp/foo2', 'w') as f:
...     f.write(('"{}" = "{}";\n'.format('no_internet', data.encode('utf-8'))))
... 
>>> import sys;sys.exit(0)
 % cat /tmp/foo2
"no_internet" = "Bitte überprüfen";

Update:

as a general rule:

when you get a DecodeError you shall use the .decode('utf-8') on the string that contains unicode data and
when you get an EncodeError, you shall use the .encode('utf-8') on the string that contains unicode data

Update: if you cannot update to python3, you can at least make your python2 behave like it is almost python3, using the following python-future import statement:

from __future__ import absolute_import, division, print_function, unicode_literals

HTH

Upvotes: 3

Karl Knechtel

Reputation: 61617

Why ascii if I say utf-8?

Because in Python 2, "Bitte überprüfen" is not a Unicode string. Before it can be .encoded by your explicit call, Python must implicitly decode it to Unicode (This is also why it raises a UnicodeDecodeError), and it chooses ASCII because it has no other information to work with. The ü is represented with some byte with value >= 128, so it's not valid ASCII.

The u prefix shown by @JuniorCompressor will make it a Unicode string, and you should specify the encoding for the file as well (don't just blindly set utf-8; it needs to match whatever your text editor saves the .py file with!).

Switching to Python 3 is realistically (part of) a better long-term solution :) but it is still essential to understand the problem. See http://bit.ly/unipain for more details. The Python 2 behaviour is really a bug, or at least a failure to meet Pythonic design principles: Explicit is better than implicit, and here we see why very clearly ;)

Upvotes: 0

JuniorCompressor

Reputation: 20025

Try:

value = u"Bitte überprüfen"

in order to declare value as a unicode string and

# -*- coding: utf-8 -*-

at the start of your file in order to declare that your python file is saved with utf-8 encoding.

Upvotes: 6

Python UnicodeDecodeError when writing German letters

Answers (4)

Related Questions