Reputation: 13
I have a question regarding opening and reading a CSV file with encoded in utf-8 using Python. I spent most of the day browsing Stackoverflow topics and the Python csv module but I can't seem to find the right solution. My CSV file contains Spanish and German words with 'special' characters (ñ,é,etc.) , here is a snippet of my file:
english_person,spanish_M,spanish_F,german_person
woman,mujer ,mujer ,Frau
strong,fuerte ,fuerte ,stark
boy,niño ,niño ,Junge
Simply trying to read it with the codecs module doesn't work:
import csv
import codecs
f = codecs.open('file.csv', 'rb', encoding='utf-8')
reader = csv.reader(f)
for line in reader:
print line
I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
So, I downloaded the unicodecsv module and attempt to read the file like this:
import unicodecsv
myfile = open('file.csv')
data = unicodecsv.reader(myfile, encoding='utf-8', delimiter=';')
for row in data:
print row
I luckily don't get an error anymore, but I still get these strange characters in my output (in the last line):
[u'\ufeffenglish_person,spanish_M,spanish_F,german_person']
[u'woman,mujer ,mujer ,Frau ']
[u'strong,fuerte ,fuerte ,stark ']
[u'boy,ni\xf1o ,ni\xf1o ,Junge ']
What is going on and how can I solve this? Thank you for your help!
Upvotes: 1
Views: 5140
Reputation: 536567
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0
That's not a problem reading the CSV. That's a problem print
ing it to the console. Your console doesn't support Unicode, so it can't print the U+FEFF Byte Order Mark character from the front of the CSV file. (It's common to put a faux-BOM in UTF-8 CSV files as Excel won't read them otherwise.)
The Windows console is essentially broken for Unicode from applications using the MS C runtime stdlib. PrintFails
I luckily don't get an error anymore, but I still get these strange characters in my output (in the last line):
You are printing rows here, not individual values. Each row is a list of strings. When you print a list it comes out in repr
form, so your strings are printing in Python string literal form. u'ni\xf1o'
and u'niño'
are the same string.
(This is slightly clearer if you use the correct delimiter ,
rather than ;
.)
Upvotes: 1