Alice
Alice

Reputation: 13

Python: Read German/Spanish CSV files with UTF-8 encoding

I have a question regarding opening and reading a CSV file with encoded in utf-8 using Python. I spent most of the day browsing Stackoverflow topics and the Python csv module but I can't seem to find the right solution. My CSV file contains Spanish and German words with 'special' characters (ñ,é,etc.) , here is a snippet of my file:

english_person,spanish_M,spanish_F,german_person
woman,mujer ,mujer ,Frau 
strong,fuerte ,fuerte ,stark 
boy,niño ,niño ,Junge 

Simply trying to read it with the codecs module doesn't work:

import csv
import codecs

f = codecs.open('file.csv', 'rb', encoding='utf-8')
reader = csv.reader(f)
for line in reader:
    print line

I get this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

So, I downloaded the unicodecsv module and attempt to read the file like this:

import unicodecsv

myfile = open('file.csv')
data = unicodecsv.reader(myfile, encoding='utf-8', delimiter=';')
for row in data:                                                 
    print row

I luckily don't get an error anymore, but I still get these strange characters in my output (in the last line):

[u'\ufeffenglish_person,spanish_M,spanish_F,german_person']
[u'woman,mujer ,mujer ,Frau ']
[u'strong,fuerte ,fuerte ,stark ']
[u'boy,ni\xf1o ,ni\xf1o ,Junge ']

What is going on and how can I solve this? Thank you for your help!

Upvotes: 1

Views: 5140

Answers (1)

bobince
bobince

Reputation: 536567

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0

That's not a problem reading the CSV. That's a problem printing it to the console. Your console doesn't support Unicode, so it can't print the U+FEFF Byte Order Mark character from the front of the CSV file. (It's common to put a faux-BOM in UTF-8 CSV files as Excel won't read them otherwise.)

The Windows console is essentially broken for Unicode from applications using the MS C runtime stdlib. PrintFails

I luckily don't get an error anymore, but I still get these strange characters in my output (in the last line):

You are printing rows here, not individual values. Each row is a list of strings. When you print a list it comes out in repr form, so your strings are printing in Python string literal form. u'ni\xf1o' and u'niño' are the same string.

(This is slightly clearer if you use the correct delimiter , rather than ;.)

Upvotes: 1

Related Questions