acpigeon
acpigeon

Reputation: 1729

Trouble with UTF-8 CSV input in Python

This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I'm attempting to follow the principle of the "Unicode Sandwich" and decode upon reading the file in:

import codecs
import csv

with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file:
input_file = csv.reader(file, delimiter=",", quotechar='|')
list = []
for row in input_file:
    list.extend(row)

This produces the dread 'codec can't encode characters in position, ordinal not in range(128)' error.

I've also tried adapting a solution from this answer, which returns a similar error

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

filename = 'inputs\encode.csv'
reader = unicode_csv_reader(open(filename))
target_list = []
for field1 in reader:
    target_list.extend(field1)

A very similar solution adapted from the docs returns the same error.

def unicode_csv_reader(utf8_data, dialect=csv.excel):
    csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
    yield line.encode('utf-8')

filename = 'inputs\encode.csv'
reader = unicode_csv_reader(open(filename))
target_list = []
for field1 in reader:
    target_list.extend(field1)

Clearly I'm missing something. Most of the questions that I've seen regarding this problem seem to predate Python 2.7, so an update here might be useful.

Upvotes: 21

Views: 25994

Answers (3)

John Machin
John Machin

Reputation: 83032

Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.

Your 2nd and 3rd snippets are confused. Something like the following is all that you need:

f = open('your_utf8_encoded_file.csv', 'rb')
reader = csv.reader(f)
for utf8_row in reader:
    unicode_row = [x.decode('utf8') for x in utf8_row]
    print unicode_row

Upvotes: 18

Zeugma
Zeugma

Reputation: 32125

At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

Upvotes: 12

Clarus
Clarus

Reputation: 2338

I'd suggest trying just:

input_file = csv.reader(open('utf8file.csv', 'r'), delimiter=",", quotechar='|')

or

input_file = csv.reader(open('utf8file.csv', 'rb'), delimiter=",", quotechar='|')

csv should be unicode aware, and it should just work.

Upvotes: -2

Related Questions