Reputation: 21118
I tried to use csv
module to parse csv file, but it does not handle utf-8 encodings.
So I tried these methods that were suggested in documentation:
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
But if I try to use it like that:
with open(u'spam1.csv', 'rb') as csvfile:
spamreader = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
for row in spamreader:
print row
I get this error:
yield line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)
But if I open that file with libreoffice, it opens that csv file with utf-8 encoding fine.
Upvotes: 2
Views: 1274
Reputation: 1121514
The code is meant to be used on unicode values; e.g. you need to decode your data to unicode
before passing it in to the replacement reader.
Use io.open()
read the data as Unicode:
import io
with io.open(u'spam1.csv', 'r', encoding='utf8') as csvfile:
spamreader = unicode_csv_reader(csvfile, delimiter=',', quotechar='"')
for row in spamreader:
print row
This basically temporarily encodes unicode to UTF8 for the CSV module to handle.
Because your data is already encoded to UTF8, you could get away with:
with open(u'spam1.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in spamreader:
row = [unicode(cell, 'utf-8') for cell in row]
as well; so directly decode your row cells from UTF8 without decoding to Unicode first, then encoding again to UTF8 bytes then decoding again.
Upvotes: 3