Reputation: 2293
I have a python script that is using xlwt/xlrd to work with an excel file. at the beginning of my script I have the following code:
#if you got a csv in parameters, convert it to an xls file
if '.csv' in sys.argv[1]:
#name of new file after conversion is finished
name = sys.argv[1]
csvfile = open(sys.argv[1], 'rb')
try:
#extract data from .csv
csvReader = csv.reader(csvfile, delimiter=' ', quotechar='|')
csvData = list(csv.reader(open(name, 'rb')))
# write to a xls file
outFile = xlwt.Wrokbook()
newSheet = outFile.add_sheet('Sheet 1')
# traverse over 2d array to write each individual cell
for row in range(len(csvData)):
for col in range(len(csvData[0])):
newSheet.write(row, col, csvData[row][col].encode('utf8'))
name = name[:-4] + ".xls" #change extension of file
outFile.save(name)
wb = open_workbook(name)
finally:
csvfile.close()
Which gives the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 44: ordinal not in range(128)
on the line outFile.save(name)
The only useful thing I've found so far to help with this is UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1, but my terminal is using utf8 as its encoding.
EDIT: Totally forgot to mention this, so sorry.
I believe that the line with .encode is causing the error somehow but I can't think how. I originally had it without the .encode, then i added .encode('utf8'), also tried .encode('utf-8') and unicode(string, 'utf8'). I'm not sure what else to try to solve this problem.
EDIT: I tried Brian's suggestion to no avail. Additionally I attempted the codecs.open suggestion and also tried specifying the encoding when I create the workbook. None of these things change the error at all. The only thing I've tried that changes the error is the adding of the .encode on the line with newSheet.write. Without it I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 44: ordinal no in range(128)
And with it I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 44: ordinal not in range(128)
Upvotes: 0
Views: 2407
Reputation: 37319
The problem is that when you have non-ascii characters in your input, you're not passing them to xlwt in the state it expects.
According to xlwt's docs:
A unicode instance is written as is. A str instance is converted to unicode using the encoding (default: 'ascii') specified when the Workbook instance was created.
https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/doc/xlwt.html?p=4966#xlwt.Worksheet.write-method
That is, when your input csv file contains a non-ascii character encoded using utf-8, the reader pulls it in as an encoded Python string - if you viewed it directly you'd see the multiple hex bytes, eg '\xc3\xa1'
for a lower case a-acute. When you write that to the worksheet, it has to decode it. When creating your workbook, you didn't specify an encoding, so it tries to do this using the default ascii
encoding. Which doesn't work, as you see, since those are not ascii bytes.
Your options are to either pass in Unicode strings to the worksheet, decoding them from the csv reader's results (or wrapping the csv reader in something that decodes everything - it's the same thing) or to set the encoding on the workbook when creating it.
Upvotes: 0
Reputation: 3131
According to the docs:
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
Try the following snippet which gives you a generator that reads a csv with unicode data. Note that this code is taken directly from the docs linked above:
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
As an example of how to use the above code, instead of
csvReader = csv.reader(csvfile, delimiter=' ', quotechar='|')
use
csvReader = unicode_csv_reader(csvfile, delimiter=' ', quotechar='|')
yield
is the return equivalent for a generator function. The function returns a generator object, which is a type of iterable in python. **kwargs
represents keyword arguments, which is what you are actually passing when you write delimiter=' ', quotechar='|'
Upvotes: 1
Reputation: 20757
Try using the built-in codecs library to open the file:
#!/usr/bin/env python2.7
# -*- coding: UTF-8 -*-
import codecs
with codecs.open(sys.argv[1], "rb", encoding="utf-8") as csvfile:
csvReader = csv.reader(csvfile, delimiter=' ', quotechar='|')
# snipped the rest of the code
Upvotes: 0