Baz
Baz

Reputation: 13133

UnicodeDecodeError: 'utf8' codec can't decode byte "0xc3"

In python 2.7 I have this:

# -*- coding: utf-8 -*-
from nltk.corpus import abc
with open("abc.txt","w") as f:
    f.write(" ".join(i.words()))

I then try to read in this document in Python 3:

 with open("abc.txt", 'r', encoding='utf-8') as f:
     f.read()

only to get:

  File "C:\Python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 633096: invalid continuation byte

What have I done wrong? Notepad++ seems to indicate that the document is Unicode utf-8. Even if I try to convert the document to this format with Notepad++ I still get this error in python 3, which is strange since I read many other utf-8 encoded documents without any problems.

Upvotes: 11

Views: 52150

Answers (3)

Donn Cave
Donn Cave

Reputation: 1

Happened to encounter this question while trying to find out what the point of this \303 or 0xC3 character might be. But whatever the idea, it isn't a very good idea, it's just an illegal character that some text editors throw in. Could be something like non-breaking space in their little world.

So here's python3, which takes it upon itself to go through all its inputs and convert to unicode - with the unfortunate results you found if there's an illegal character. Because what could it benefit you to go forward, with an illegal character in your text data.

Upvotes: -2

Weeble
Weeble

Reputation: 17910

Based on the fact that your piece of Python 2.7 doesn't throw an exception, I would infer that i.words() returns a sequence of bytestrings. These are unlikely to be encoded in UTF8 - I'd guess maybe Latin-1 or something like that. You then write them to the file. No encoding happens at this point.

You probably need to convert these to unicode strings, for which you'll need to know their existing encoding, and then you'll need to encode these as UTF-8 when writing the file.

For example:

# -*- coding: utf-8 -*-
from nltk.corpus import abc
import codecs
with codecs.open("abc.txt","w","utf-8") as f:
    f.write(u" ".join(codecs.decode(word,"latin-1") for word in i.words()))

Some further notes, in case there's any confusion:

  • The -*- coding: utf-8 -*- line refers to the encoding used to write the Python script itself. It has no effect on the input or output of that script.
  • In Python 2.7, there are two kinds of strings: bytestrings, which are sequences of bytes with an unspecified encoding, and unicode strings, which are sequences of unicode code points. Bytestrings are most common and are what you get if you use the regular "abc" string literal syntax. Unicode strings are what you get when you use the u"abc" syntax.
  • In Python 2.7, if you just use the open function to open a file and write bytestrings to it, no encoding will happen. The bytes of the bytestring are written straight into the file. If you try to write unicode strings to it, you'll get an exception if they contain characters that can't be encoded by the default (ASCII) codec.

Upvotes: 2

user1907906
user1907906

Reputation:

My guess is that your input is encoded as ISO-8859-2 which contains Ă as 0xC3. Check the encoding of your input file.

Upvotes: 4

Related Questions