avorum
avorum

Reputation: 2293

trouble decoding unicode properly in python

I have a python script that is using xlwt/xlrd to work with an excel file. at the beginning of my script I have the following code:

#if you got a csv in parameters, convert it to an xls file
if '.csv' in sys.argv[1]:
    #name of new file after conversion is finished
    name = sys.argv[1]
    csvfile = open(sys.argv[1], 'rb')
    try:
        #extract data from .csv
        csvReader = csv.reader(csvfile, delimiter=' ', quotechar='|')
        csvData = list(csv.reader(open(name, 'rb')))
        # write to a xls file
        outFile = xlwt.Wrokbook()
        newSheet = outFile.add_sheet('Sheet 1')
        # traverse over 2d array to write each individual cell
        for row in range(len(csvData)):
            for col in range(len(csvData[0])):
                newSheet.write(row, col, csvData[row][col].encode('utf8'))
         name = name[:-4] + ".xls" #change extension of file
         outFile.save(name)
         wb = open_workbook(name)
    finally:
         csvfile.close()

Which gives the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 44: ordinal not in range(128)

on the line outFile.save(name)

The only useful thing I've found so far to help with this is UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1, but my terminal is using utf8 as its encoding.

EDIT: Totally forgot to mention this, so sorry.

I believe that the line with .encode is causing the error somehow but I can't think how. I originally had it without the .encode, then i added .encode('utf8'), also tried .encode('utf-8') and unicode(string, 'utf8'). I'm not sure what else to try to solve this problem.

EDIT: I tried Brian's suggestion to no avail. Additionally I attempted the codecs.open suggestion and also tried specifying the encoding when I create the workbook. None of these things change the error at all. The only thing I've tried that changes the error is the adding of the .encode on the line with newSheet.write. Without it I get:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 44: ordinal no in range(128)

And with it I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 44: ordinal not in range(128)

Upvotes: 0

Views: 2407

Answers (3)

Peter DeGlopper
Peter DeGlopper

Reputation: 37319

The problem is that when you have non-ascii characters in your input, you're not passing them to xlwt in the state it expects.

According to xlwt's docs:

A unicode instance is written as is. A str instance is converted to unicode using the encoding (default: 'ascii') specified when the Workbook instance was created.

https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/doc/xlwt.html?p=4966#xlwt.Worksheet.write-method

That is, when your input csv file contains a non-ascii character encoded using utf-8, the reader pulls it in as an encoded Python string - if you viewed it directly you'd see the multiple hex bytes, eg '\xc3\xa1' for a lower case a-acute. When you write that to the worksheet, it has to decode it. When creating your workbook, you didn't specify an encoding, so it tries to do this using the default ascii encoding. Which doesn't work, as you see, since those are not ascii bytes.

Your options are to either pass in Unicode strings to the worksheet, decoding them from the csv reader's results (or wrapping the csv reader in something that decodes everything - it's the same thing) or to set the encoding on the workbook when creating it.

Upvotes: 0

Brian
Brian

Reputation: 3131

According to the docs:

The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

Try the following snippet which gives you a generator that reads a csv with unicode data. Note that this code is taken directly from the docs linked above:

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

As an example of how to use the above code, instead of

csvReader = csv.reader(csvfile, delimiter=' ', quotechar='|')

use

csvReader = unicode_csv_reader(csvfile, delimiter=' ', quotechar='|')

yield is the return equivalent for a generator function. The function returns a generator object, which is a type of iterable in python. **kwargs represents keyword arguments, which is what you are actually passing when you write delimiter=' ', quotechar='|'

Upvotes: 1

Chris Doggett
Chris Doggett

Reputation: 20757

Try using the built-in codecs library to open the file:

#!/usr/bin/env python2.7
# -*- coding: UTF-8 -*-
import codecs

with codecs.open(sys.argv[1], "rb", encoding="utf-8") as csvfile:
    csvReader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    # snipped the rest of the code

Upvotes: 0

Related Questions