Reputation: 13133
In python 2.7 I have this:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
with open("abc.txt","w") as f:
f.write(" ".join(i.words()))
I then try to read in this document in Python 3:
with open("abc.txt", 'r', encoding='utf-8') as f:
f.read()
only to get:
File "C:\Python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 633096: invalid continuation byte
What have I done wrong? Notepad++ seems to indicate that the document is Unicode utf-8. Even if I try to convert the document to this format with Notepad++ I still get this error in python 3, which is strange since I read many other utf-8 encoded documents without any problems.
Upvotes: 11
Views: 52150
Reputation: 1
Happened to encounter this question while trying to find out what the point of this \303 or 0xC3 character might be. But whatever the idea, it isn't a very good idea, it's just an illegal character that some text editors throw in. Could be something like non-breaking space in their little world.
So here's python3, which takes it upon itself to go through all its inputs and convert to unicode - with the unfortunate results you found if there's an illegal character. Because what could it benefit you to go forward, with an illegal character in your text data.
Upvotes: -2
Reputation: 17910
Based on the fact that your piece of Python 2.7 doesn't throw an exception, I would infer that i.words()
returns a sequence of bytestrings. These are unlikely to be encoded in UTF8 - I'd guess maybe Latin-1 or something like that. You then write them to the file. No encoding happens at this point.
You probably need to convert these to unicode strings, for which you'll need to know their existing encoding, and then you'll need to encode these as UTF-8 when writing the file.
For example:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
import codecs
with codecs.open("abc.txt","w","utf-8") as f:
f.write(u" ".join(codecs.decode(word,"latin-1") for word in i.words()))
Some further notes, in case there's any confusion:
-*- coding: utf-8 -*-
line refers to the encoding used to write the Python script itself. It has no effect on the input or output of that script."abc"
string literal syntax. Unicode strings are what you get when you use the u"abc"
syntax.Upvotes: 2
Reputation:
My guess is that your input is encoded as ISO-8859-2 which contains Ă
as 0xC3
. Check the encoding of your input file.
Upvotes: 4