Reputation: 1512
In Writing utf-8 formated Python lists to CSV @abamert suggests some sample code from the csv documentation to handle this case.
I am unable to fix the problem with that code, and I wonder what I am doing wrong.
Here is my test code:
# -*- coding: UTF-8 -*-
import csv
import codecs
import csvutf8 # sample code from csv documentation.
x = u'owner’s'
with codecs.open('simpleout.txt', 'wb', 'UTF_8') as of:
spamwriter = csvutf8.UnicodeWriter(of)
spamwriter.writerow([x])
and csvutf8.py, the file into which I copied and pasted the code from the documentation, is at the end of this message.
The error message from codecs.py
in the library is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128)
What can I do to make this work?
csvutf8.py
"""Helper classes to output UTF_8 through CSV in Python 2.x"""
import csv, codecs, cStringIO
class UTF8Recoder:
"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
Upvotes: 0
Views: 406
Reputation: 365657
The UnicodeWriter
sample code is meant to be used with a plain bytes file like you get from open
, not a Unicode file like you get from codecs.open
(or io.open
). The simplest fix is to just use open
instead of codecs.open
in your main script:
with open('simpleout.txt', 'wb') as of:
If you're going to be using csvutf8
in a project you'll be coming back to a year from now, or working on with other colleagues, you may want to consider adding a test like this in the __init__
methods, so the next time you make this mistake (which you will) it'll show up immediately, and with a more obvious error:
if isinstance(f, (
codecs.StreamReader, codecs.StreamWriter,
codecs.StreamReaderWriter, io.TextIOBase)):
raise TypeError(
'Need plain bytes files, not {}'.format(f.__class__))
But if you're going to stick with Python 2,* these errors are hard to find until you get the hang of it, so you should learn how to spot them now. Here's some simpler code with the same error:
data1 = u'[owner’s]'
data2 = data1.encode('utf-8')
data3 = data2.encode('utf-8')
Test this in the interactive interpreter, and look at the repr, type, etc. of each intermediate step. You'll see that data2
is a str
, not a unicode
. That means it's just a bunch of bytes. What does it mean to encode a bunch of bytes to UTF-8? The only thing that makes sense** is to decode those bytes using your default encoding (which is ASCII because you haven't set anything else) into Unicode so that it can then encoded back to bytes.
So, when you see one of those UnicodeDecodeError
s about ASCII (and you're pretty sure you were calling encode
rather than decode
), it's usually this problem. Check the type you're calling it on, and it's probably a str
rather than a unicode
.***
* I assume you have a good reason beyond your control for still using Python 2 in 2018. If not, the answer is a lot easier: just use Python 3 and this whole problem is impossible (and the code is simpler, and runs faster).
** If you think it would actually make a lot more sense for Python to just not try to guess what you meant, and make this an error… you're right, and that's one of the main reasons Python 3 exists.
*** Of course you still need to figure out why you have bytes where
you expected Unicode. Sometimes it's really silly, like you did u = s.decode('latin1')
but then you kept using s
instead of u
. Sometimes it's a little trickier, like this case, where you're using a library that's automatically encoding for you, but you didn't realize it. Sometimes it's even worse, like you've forgotten to decode some text off a website and it runs all day silently creates mojibake for thousands of pages before running into the first one with a Slavic name and finally gets an error.
Upvotes: 1