csv dictReader encoding not correct

Question

I have the following script to read a UTF-8 CSV:

def readCSV(f, bdgs):
with open(f) as csvfile:
    reader = csv.DictReader(csvfile, delimiter=';')
    for row in reader:
        for key, val  in row.iteritems():
            print type(key), key,':',type(val),val
        print type(row), row
        if row['OBJECTID'] is not '':
            # do some magic

which yield this :

processing the following files: ['Fenetre.csv']
 Type de fenêtre_uniqueid :  uid-100
 Type de fenêtre_CheckDelete :  
 Type de fenêtre_Nom :  Fenetre 2006-2010
 Type de fenêtre_Intercalaire :  1
 Liste des fenêtres_Hauteur :  3.29
 OBJECTID :  3760
 Liste des fenêtres_Nb vantaux :  2
 Liste des fenêtres_Façade :  uid-001-AW1
 Type de fenêtre_Cadre :  7
 Type de fenêtre_vitrage :  4
 Liste des fenêtres_Part cadre :  20
 Liste des fenêtres_Nom :  f1
 Liste des fenêtres_Nombre :  1
 Liste des fenêtres_Ombrage1 :  uid-201
 Liste des fenêtres_Largeur :  1.55
 Liste des fenêtres_Ombrage2 :  
 Liste des fenêtres_CheckDelete :  
 Liste des fenêtres_Type de fenêtre :  uid-100
 {'Type de fen\xc3\xaatre_uniqueid': 'uid-100', 'Type de fen\xc3\xaatre_CheckDelete': '', 'Type de fen\xc3\xaatre_Nom': 'Fenetre 2006-2010', 'Type de fen\xc3\xaatre_Intercalaire': '1', 'Liste des fen\xc3\xaatres_Hauteur': '3.29', '\xef\xbb\xbfOBJECTID': '3760', 'Liste des fen\xc3\xaatres_Nb vantaux': '2', 'Liste des fen\xc3\xaatres_Fa\xc3\xa7ade': 'uid-001-AW1', 'Type de fen\xc3\xaatre_Cadre': '7', 'Type de fen\xc3\xaatre_vitrage': '4', 'Liste des fen\xc3\xaatres_Part cadre': '20', 'Liste des fen\xc3\xaatres_Nom': 'f1', 'Liste des fen\xc3\xaatres_Nombre': '1', 'Liste des fen\xc3\xaatres_Ombrage1': 'uid-201', 'Liste des fen\xc3\xaatres_Largeur': '1.55', 'Liste des fen\xc3\xaatres_Ombrage2': '', 'Liste des fen\xc3\xaatres_CheckDelete': '', 'Liste des fen\xc3\xaatres_Type de fen\xc3\xaatre': 'uid-100'}
Traceback (most recent call last):
  File "./oba.py", line 120, in 
    sys.exit(main())
  File "./oba.py", line 115, in main
    readCSV(f,out)
  File "./oba.py", line 37, in readCSV
    if row['OBJECTID'] is not '':
KeyError: 'OBJECTID'

if you look at the last line before the stack trace, you see that although the encoding for the key and values strings in the first row are all correct. the dict does not store the key/values with the proper encoding. Hence the error.

In order to fix this issue, I tried this:

def unicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8') : unicode(value, 'utf-8') for key, value in row.iteritems()}

def readCSV(f, bdgs):
    js=getJSONmap()
    with open(f) as csvfile:
        reader = unicodeDictReader(csvfile, delimiter=';')
        for row in reader:
            for key, val  in row.iteritems():
                print type(key), key,':',type(val),val
            print type(row), row
            if row['OBJECTID'] is not '':

which yield this:

 Type de fenêtre_Cadre :  7
 Liste des fenêtres_Hauteur :  3.29
 Type de fenêtre_uniqueid :  uid-100
 Liste des fenêtres_Nom :  f1
 OBJECTID :  3760
 Type de fenêtre_Intercalaire :  1
 Liste des fenêtres_Ombrage1 :  uid-201
 Liste des fenêtres_Largeur :  1.55
 Liste des fenêtres_Part cadre :  20
 Liste des fenêtres_Type de fenêtre :  uid-100
 Type de fenêtre_Nom :  Fenetre 2006-2010
 Liste des fenêtres_CheckDelete :  
 Liste des fenêtres_Nb vantaux :  2
 Type de fenêtre_CheckDelete :  
 Type de fenêtre_vitrage :  4
 Liste des fenêtres_Façade :  uid-001-AW1
 Liste des fenêtres_Ombrage2 :  
 Liste des fenêtres_Nombre :  1
 {u'Type de fen\xeatre_Cadre': u'7', u'Liste des fen\xeatres_Hauteur': u'3.29', u'Type de fen\xeatre_uniqueid': u'uid-100', u'Liste des fen\xeatres_Nom': u'f1', u'\ufeffOBJECTID': u'3760', u'Type de fen\xeatre_Intercalaire': u'1', u'Liste des fen\xeatres_Ombrage1': u'uid-201', u'Liste des fen\xeatres_Largeur': u'1.55', u'Liste des fen\xeatres_Part cadre': u'20', u'Liste des fen\xeatres_Type de fen\xeatre': u'uid-100', u'Type de fen\xeatre_Nom': u'Fenetre 2006-2010', u'Liste des fen\xeatres_CheckDelete': u'', u'Liste des fen\xeatres_Nb vantaux': u'2', u'Type de fen\xeatre_CheckDelete': u'', u'Type de fen\xeatre_vitrage': u'4', u'Liste des fen\xeatres_Fa\xe7ade': u'uid-001-AW1', u'Liste des fen\xeatres_Ombrage2': u'', u'Liste des fen\xeatres_Nombre': u'1'}
Traceback (most recent call last):
  File "./oba.py", line 120, in 
    sys.exit(main())
  File "./oba.py", line 115, in main
    readCSV(f,out)
  File "./oba.py", line 37, in readCSV
    if row['OBJECTID'] is not '':
KeyError: 'OBJECTID'

which now makes me perplex as to what is happening behind the scenes with encoding:

how do I fix this without having to look for "\ufeffOBJECTID" in my dict?
why in my second attempt is python acknowledging that it reads utf-8 (unicode) data in the row but still displays it the wrong way when I print the row as a dict?
is that a problem with printing/storing unicode dicts in python ? Is this behavior expected for any container?

can someone with more insight on encoding give me some info on what is happening behind the scenes here ?

Thanks.

EDIT: also worth mentionning that in the header of the file I have declared the encoding as utf-8. (i.e. "# -- coding: utf-8 -- "), and I am running v2.7.6

Mark Tolonen · Accepted Answer

How do I fix this without having to look for "\ufeffOBJECTID" in my dict?

Use utf-8-sig instead of utf-8 for the decode. It automatically removes the BOM codepoint when decoding a UTF-8-encoded byte string.

Why in my second attempt is python acknowledging that it reads utf-8 (unicode) data in the row but still displays it the wrong way when I print the row as a dict?

Printing containers uses repr() when printing container items. This is so you can see the actual data in the strings. print the container items directly to see their "pretty" version.

Is that a problem with printing/storing unicode dicts in python ?

It's not a problem. It's just a display method. The data in the strings is the same.

Is this behavior expected for any container?

Yes.

Also, do not use is to test for empty string. Use:

if row['OBJECTID'] != '':

or better, since empty strings are considered false:

if row['OBJECTID']:

csv dictReader encoding not correct

Answers (2)

Related Questions