Reputation: 285
I have the following script to read a UTF-8 CSV:
def readCSV(f, bdgs):
with open(f) as csvfile:
reader = csv.DictReader(csvfile, delimiter=';')
for row in reader:
for key, val in row.iteritems():
print type(key), key,':',type(val),val
print type(row), row
if row['OBJECTID'] is not '':
# do some magic
which yield this :
processing the following files: ['Fenetre.csv']
<type 'str'> Type de fenêtre_uniqueid : <type 'str'> uid-100
<type 'str'> Type de fenêtre_CheckDelete : <type 'str'>
<type 'str'> Type de fenêtre_Nom : <type 'str'> Fenetre 2006-2010
<type 'str'> Type de fenêtre_Intercalaire : <type 'str'> 1
<type 'str'> Liste des fenêtres_Hauteur : <type 'str'> 3.29
<type 'str'> OBJECTID : <type 'str'> 3760
<type 'str'> Liste des fenêtres_Nb vantaux : <type 'str'> 2
<type 'str'> Liste des fenêtres_Façade : <type 'str'> uid-001-AW1
<type 'str'> Type de fenêtre_Cadre : <type 'str'> 7
<type 'str'> Type de fenêtre_vitrage : <type 'str'> 4
<type 'str'> Liste des fenêtres_Part cadre : <type 'str'> 20
<type 'str'> Liste des fenêtres_Nom : <type 'str'> f1
<type 'str'> Liste des fenêtres_Nombre : <type 'str'> 1
<type 'str'> Liste des fenêtres_Ombrage1 : <type 'str'> uid-201
<type 'str'> Liste des fenêtres_Largeur : <type 'str'> 1.55
<type 'str'> Liste des fenêtres_Ombrage2 : <type 'str'>
<type 'str'> Liste des fenêtres_CheckDelete : <type 'str'>
<type 'str'> Liste des fenêtres_Type de fenêtre : <type 'str'> uid-100
<type 'dict'> {'Type de fen\xc3\xaatre_uniqueid': 'uid-100', 'Type de fen\xc3\xaatre_CheckDelete': '', 'Type de fen\xc3\xaatre_Nom': 'Fenetre 2006-2010', 'Type de fen\xc3\xaatre_Intercalaire': '1', 'Liste des fen\xc3\xaatres_Hauteur': '3.29', '\xef\xbb\xbfOBJECTID': '3760', 'Liste des fen\xc3\xaatres_Nb vantaux': '2', 'Liste des fen\xc3\xaatres_Fa\xc3\xa7ade': 'uid-001-AW1', 'Type de fen\xc3\xaatre_Cadre': '7', 'Type de fen\xc3\xaatre_vitrage': '4', 'Liste des fen\xc3\xaatres_Part cadre': '20', 'Liste des fen\xc3\xaatres_Nom': 'f1', 'Liste des fen\xc3\xaatres_Nombre': '1', 'Liste des fen\xc3\xaatres_Ombrage1': 'uid-201', 'Liste des fen\xc3\xaatres_Largeur': '1.55', 'Liste des fen\xc3\xaatres_Ombrage2': '', 'Liste des fen\xc3\xaatres_CheckDelete': '', 'Liste des fen\xc3\xaatres_Type de fen\xc3\xaatre': 'uid-100'}
Traceback (most recent call last):
File "./oba.py", line 120, in <module>
sys.exit(main())
File "./oba.py", line 115, in main
readCSV(f,out)
File "./oba.py", line 37, in readCSV
if row['OBJECTID'] is not '':
KeyError: 'OBJECTID'
if you look at the last line before the stack trace, you see that although the encoding for the key and values strings in the first row are all correct. the dict does not store the key/values with the proper encoding. Hence the error.
In order to fix this issue, I tried this:
def unicodeDictReader(utf8_data, **kwargs):
csv_reader = csv.DictReader(utf8_data, **kwargs)
for row in csv_reader:
yield {unicode(key, 'utf-8') : unicode(value, 'utf-8') for key, value in row.iteritems()}
def readCSV(f, bdgs):
js=getJSONmap()
with open(f) as csvfile:
reader = unicodeDictReader(csvfile, delimiter=';')
for row in reader:
for key, val in row.iteritems():
print type(key), key,':',type(val),val
print type(row), row
if row['OBJECTID'] is not '':
which yield this:
<type 'unicode'> Type de fenêtre_Cadre : <type 'unicode'> 7
<type 'unicode'> Liste des fenêtres_Hauteur : <type 'unicode'> 3.29
<type 'unicode'> Type de fenêtre_uniqueid : <type 'unicode'> uid-100
<type 'unicode'> Liste des fenêtres_Nom : <type 'unicode'> f1
<type 'unicode'> OBJECTID : <type 'unicode'> 3760
<type 'unicode'> Type de fenêtre_Intercalaire : <type 'unicode'> 1
<type 'unicode'> Liste des fenêtres_Ombrage1 : <type 'unicode'> uid-201
<type 'unicode'> Liste des fenêtres_Largeur : <type 'unicode'> 1.55
<type 'unicode'> Liste des fenêtres_Part cadre : <type 'unicode'> 20
<type 'unicode'> Liste des fenêtres_Type de fenêtre : <type 'unicode'> uid-100
<type 'unicode'> Type de fenêtre_Nom : <type 'unicode'> Fenetre 2006-2010
<type 'unicode'> Liste des fenêtres_CheckDelete : <type 'unicode'>
<type 'unicode'> Liste des fenêtres_Nb vantaux : <type 'unicode'> 2
<type 'unicode'> Type de fenêtre_CheckDelete : <type 'unicode'>
<type 'unicode'> Type de fenêtre_vitrage : <type 'unicode'> 4
<type 'unicode'> Liste des fenêtres_Façade : <type 'unicode'> uid-001-AW1
<type 'unicode'> Liste des fenêtres_Ombrage2 : <type 'unicode'>
<type 'unicode'> Liste des fenêtres_Nombre : <type 'unicode'> 1
<type 'dict'> {u'Type de fen\xeatre_Cadre': u'7', u'Liste des fen\xeatres_Hauteur': u'3.29', u'Type de fen\xeatre_uniqueid': u'uid-100', u'Liste des fen\xeatres_Nom': u'f1', u'\ufeffOBJECTID': u'3760', u'Type de fen\xeatre_Intercalaire': u'1', u'Liste des fen\xeatres_Ombrage1': u'uid-201', u'Liste des fen\xeatres_Largeur': u'1.55', u'Liste des fen\xeatres_Part cadre': u'20', u'Liste des fen\xeatres_Type de fen\xeatre': u'uid-100', u'Type de fen\xeatre_Nom': u'Fenetre 2006-2010', u'Liste des fen\xeatres_CheckDelete': u'', u'Liste des fen\xeatres_Nb vantaux': u'2', u'Type de fen\xeatre_CheckDelete': u'', u'Type de fen\xeatre_vitrage': u'4', u'Liste des fen\xeatres_Fa\xe7ade': u'uid-001-AW1', u'Liste des fen\xeatres_Ombrage2': u'', u'Liste des fen\xeatres_Nombre': u'1'}
Traceback (most recent call last):
File "./oba.py", line 120, in <module>
sys.exit(main())
File "./oba.py", line 115, in main
readCSV(f,out)
File "./oba.py", line 37, in readCSV
if row['OBJECTID'] is not '':
KeyError: 'OBJECTID'
which now makes me perplex as to what is happening behind the scenes with encoding:
can someone with more insight on encoding give me some info on what is happening behind the scenes here ?
Thanks.
EDIT: also worth mentionning that in the header of the file I have declared the encoding as utf-8. (i.e. "# -- coding: utf-8 -- "), and I am running v2.7.6
Upvotes: 0
Views: 1281
Reputation: 177755
How do I fix this without having to look for "\ufeffOBJECTID" in my dict?
Use utf-8-sig
instead of utf-8
for the decode. It automatically removes the BOM codepoint when decoding a UTF-8-encoded byte string.
Why in my second attempt is python acknowledging that it reads utf-8 (unicode) data in the row but still displays it the wrong way when I print the row as a dict?
Printing containers uses repr()
when printing container items. This is so you can see the actual data in the strings. print
the container items directly to see their "pretty" version.
Is that a problem with printing/storing unicode dicts in python ?
It's not a problem. It's just a display method. The data in the strings is the same.
Is this behavior expected for any container?
Yes.
Also, do not use is
to test for empty string. Use:
if row['OBJECTID'] != '':
or better, since empty strings are considered false:
if row['OBJECTID']:
Upvotes: 3
Reputation: 715
I ran the script with your code and got the following output:
<type 'unicode'> Type de fenêtre_Cadre : <type 'unicode'> 7
<type 'unicode'> Liste des fenêtres_Hauteur : <type 'unicode'> 3.29
<type 'unicode'> Type de fenêtre_uniqueid : <type 'unicode'> uid-100
<type 'unicode'> Liste des fenêtres_Nom : <type 'unicode'> f1
<type 'unicode'> OBJECTID : <type 'unicode'> 3760
Notice that when pasting into this text box, StackOverflow kills the unicode character before OBJECTID
.
In answer to your questions I think that this behavior is reasonable since 'OBJECTID'
is actually not inside row
(rather, '\ufeffOBJECTID'
is)
Why is that the case that Python prints the whole row with the unicode chars is probably about how __repr__' of
dict` is implemented
if you want to get rid of the unicode characters I would suugest using unidecode
or a package of the sort and then you could refer directly to OBJECTID
I hope this help explains it
Upvotes: 1