WesR
WesR

Reputation: 1512

Write utf-8 through python csv? (prev answer not working)

In Writing utf-8 formated Python lists to CSV @abamert suggests some sample code from the csv documentation to handle this case.

I am unable to fix the problem with that code, and I wonder what I am doing wrong.

Here is my test code:

# -*- coding: UTF-8 -*-
import csv
import codecs
import csvutf8  # sample code from csv documentation.
x = u'owner’s'
with codecs.open('simpleout.txt', 'wb', 'UTF_8') as of:
    spamwriter = csvutf8.UnicodeWriter(of)
    spamwriter.writerow([x])

and csvutf8.py, the file into which I copied and pasted the code from the documentation, is at the end of this message.

The error message from codecs.py in the library is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128)

What can I do to make this work?

csvutf8.py

"""Helper classes to output UTF_8 through CSV in Python 2.x"""

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

    def __iter__(self):
        return self

    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

    class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

Upvotes: 0

Views: 406

Answers (1)

abarnert
abarnert

Reputation: 365657

The UnicodeWriter sample code is meant to be used with a plain bytes file like you get from open, not a Unicode file like you get from codecs.open (or io.open). The simplest fix is to just use open instead of codecs.open in your main script:

with open('simpleout.txt', 'wb') as of:

If you're going to be using csvutf8 in a project you'll be coming back to a year from now, or working on with other colleagues, you may want to consider adding a test like this in the __init__ methods, so the next time you make this mistake (which you will) it'll show up immediately, and with a more obvious error:

if isinstance(f, (
     codecs.StreamReader, codecs.StreamWriter,
     codecs.StreamReaderWriter, io.TextIOBase)):
    raise TypeError(
        'Need plain bytes files, not {}'.format(f.__class__))

But if you're going to stick with Python 2,* these errors are hard to find until you get the hang of it, so you should learn how to spot them now. Here's some simpler code with the same error:

data1 = u'[owner’s]'
data2 = data1.encode('utf-8')
data3 = data2.encode('utf-8')

Test this in the interactive interpreter, and look at the repr, type, etc. of each intermediate step. You'll see that data2 is a str, not a unicode. That means it's just a bunch of bytes. What does it mean to encode a bunch of bytes to UTF-8? The only thing that makes sense** is to decode those bytes using your default encoding (which is ASCII because you haven't set anything else) into Unicode so that it can then encoded back to bytes.

So, when you see one of those UnicodeDecodeErrors about ASCII (and you're pretty sure you were calling encode rather than decode), it's usually this problem. Check the type you're calling it on, and it's probably a str rather than a unicode.***


* I assume you have a good reason beyond your control for still using Python 2 in 2018. If not, the answer is a lot easier: just use Python 3 and this whole problem is impossible (and the code is simpler, and runs faster).

** If you think it would actually make a lot more sense for Python to just not try to guess what you meant, and make this an error… you're right, and that's one of the main reasons Python 3 exists.

*** Of course you still need to figure out why you have bytes where you expected Unicode. Sometimes it's really silly, like you did u = s.decode('latin1') but then you kept using s instead of u. Sometimes it's a little trickier, like this case, where you're using a library that's automatically encoding for you, but you didn't realize it. Sometimes it's even worse, like you've forgotten to decode some text off a website and it runs all day silently creates mojibake for thousands of pages before running into the first one with a Slavic name and finally gets an error.

Upvotes: 1

Related Questions