bny
bny

Reputation: 285

csv dictReader encoding not correct

I have the following script to read a UTF-8 CSV:

def readCSV(f, bdgs):
with open(f) as csvfile:
    reader = csv.DictReader(csvfile, delimiter=';')
    for row in reader:
        for key, val  in row.iteritems():
            print type(key), key,':',type(val),val
        print type(row), row
        if row['OBJECTID'] is not '':
            # do some magic

which yield this :

processing the following files: ['Fenetre.csv']
<type 'str'> Type de fenêtre_uniqueid : <type 'str'> uid-100
<type 'str'> Type de fenêtre_CheckDelete : <type 'str'> 
<type 'str'> Type de fenêtre_Nom : <type 'str'> Fenetre 2006-2010
<type 'str'> Type de fenêtre_Intercalaire : <type 'str'> 1
<type 'str'> Liste des fenêtres_Hauteur : <type 'str'> 3.29
<type 'str'> OBJECTID : <type 'str'> 3760
<type 'str'> Liste des fenêtres_Nb vantaux : <type 'str'> 2
<type 'str'> Liste des fenêtres_Façade : <type 'str'> uid-001-AW1
<type 'str'> Type de fenêtre_Cadre : <type 'str'> 7
<type 'str'> Type de fenêtre_vitrage : <type 'str'> 4
<type 'str'> Liste des fenêtres_Part cadre : <type 'str'> 20
<type 'str'> Liste des fenêtres_Nom : <type 'str'> f1
<type 'str'> Liste des fenêtres_Nombre : <type 'str'> 1
<type 'str'> Liste des fenêtres_Ombrage1 : <type 'str'> uid-201
<type 'str'> Liste des fenêtres_Largeur : <type 'str'> 1.55
<type 'str'> Liste des fenêtres_Ombrage2 : <type 'str'> 
<type 'str'> Liste des fenêtres_CheckDelete : <type 'str'> 
<type 'str'> Liste des fenêtres_Type de fenêtre : <type 'str'> uid-100
<type 'dict'> {'Type de fen\xc3\xaatre_uniqueid': 'uid-100', 'Type de fen\xc3\xaatre_CheckDelete': '', 'Type de fen\xc3\xaatre_Nom': 'Fenetre 2006-2010', 'Type de fen\xc3\xaatre_Intercalaire': '1', 'Liste des fen\xc3\xaatres_Hauteur': '3.29', '\xef\xbb\xbfOBJECTID': '3760', 'Liste des fen\xc3\xaatres_Nb vantaux': '2', 'Liste des fen\xc3\xaatres_Fa\xc3\xa7ade': 'uid-001-AW1', 'Type de fen\xc3\xaatre_Cadre': '7', 'Type de fen\xc3\xaatre_vitrage': '4', 'Liste des fen\xc3\xaatres_Part cadre': '20', 'Liste des fen\xc3\xaatres_Nom': 'f1', 'Liste des fen\xc3\xaatres_Nombre': '1', 'Liste des fen\xc3\xaatres_Ombrage1': 'uid-201', 'Liste des fen\xc3\xaatres_Largeur': '1.55', 'Liste des fen\xc3\xaatres_Ombrage2': '', 'Liste des fen\xc3\xaatres_CheckDelete': '', 'Liste des fen\xc3\xaatres_Type de fen\xc3\xaatre': 'uid-100'}
Traceback (most recent call last):
  File "./oba.py", line 120, in <module>
    sys.exit(main())
  File "./oba.py", line 115, in main
    readCSV(f,out)
  File "./oba.py", line 37, in readCSV
    if row['OBJECTID'] is not '':
KeyError: 'OBJECTID'

if you look at the last line before the stack trace, you see that although the encoding for the key and values strings in the first row are all correct. the dict does not store the key/values with the proper encoding. Hence the error.

In order to fix this issue, I tried this:

def unicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8') : unicode(value, 'utf-8') for key, value in row.iteritems()}

def readCSV(f, bdgs):
    js=getJSONmap()
    with open(f) as csvfile:
        reader = unicodeDictReader(csvfile, delimiter=';')
        for row in reader:
            for key, val  in row.iteritems():
                print type(key), key,':',type(val),val
            print type(row), row
            if row['OBJECTID'] is not '':

which yield this:

<type 'unicode'> Type de fenêtre_Cadre : <type 'unicode'> 7
<type 'unicode'> Liste des fenêtres_Hauteur : <type 'unicode'> 3.29
<type 'unicode'> Type de fenêtre_uniqueid : <type 'unicode'> uid-100
<type 'unicode'> Liste des fenêtres_Nom : <type 'unicode'> f1
<type 'unicode'> OBJECTID : <type 'unicode'> 3760
<type 'unicode'> Type de fenêtre_Intercalaire : <type 'unicode'> 1
<type 'unicode'> Liste des fenêtres_Ombrage1 : <type 'unicode'> uid-201
<type 'unicode'> Liste des fenêtres_Largeur : <type 'unicode'> 1.55
<type 'unicode'> Liste des fenêtres_Part cadre : <type 'unicode'> 20
<type 'unicode'> Liste des fenêtres_Type de fenêtre : <type 'unicode'> uid-100
<type 'unicode'> Type de fenêtre_Nom : <type 'unicode'> Fenetre 2006-2010
<type 'unicode'> Liste des fenêtres_CheckDelete : <type 'unicode'> 
<type 'unicode'> Liste des fenêtres_Nb vantaux : <type 'unicode'> 2
<type 'unicode'> Type de fenêtre_CheckDelete : <type 'unicode'> 
<type 'unicode'> Type de fenêtre_vitrage : <type 'unicode'> 4
<type 'unicode'> Liste des fenêtres_Façade : <type 'unicode'> uid-001-AW1
<type 'unicode'> Liste des fenêtres_Ombrage2 : <type 'unicode'> 
<type 'unicode'> Liste des fenêtres_Nombre : <type 'unicode'> 1
<type 'dict'> {u'Type de fen\xeatre_Cadre': u'7', u'Liste des fen\xeatres_Hauteur': u'3.29', u'Type de fen\xeatre_uniqueid': u'uid-100', u'Liste des fen\xeatres_Nom': u'f1', u'\ufeffOBJECTID': u'3760', u'Type de fen\xeatre_Intercalaire': u'1', u'Liste des fen\xeatres_Ombrage1': u'uid-201', u'Liste des fen\xeatres_Largeur': u'1.55', u'Liste des fen\xeatres_Part cadre': u'20', u'Liste des fen\xeatres_Type de fen\xeatre': u'uid-100', u'Type de fen\xeatre_Nom': u'Fenetre 2006-2010', u'Liste des fen\xeatres_CheckDelete': u'', u'Liste des fen\xeatres_Nb vantaux': u'2', u'Type de fen\xeatre_CheckDelete': u'', u'Type de fen\xeatre_vitrage': u'4', u'Liste des fen\xeatres_Fa\xe7ade': u'uid-001-AW1', u'Liste des fen\xeatres_Ombrage2': u'', u'Liste des fen\xeatres_Nombre': u'1'}
Traceback (most recent call last):
  File "./oba.py", line 120, in <module>
    sys.exit(main())
  File "./oba.py", line 115, in main
    readCSV(f,out)
  File "./oba.py", line 37, in readCSV
    if row['OBJECTID'] is not '':
KeyError: 'OBJECTID'

which now makes me perplex as to what is happening behind the scenes with encoding:

can someone with more insight on encoding give me some info on what is happening behind the scenes here ?

Thanks.

EDIT: also worth mentionning that in the header of the file I have declared the encoding as utf-8. (i.e. "# -- coding: utf-8 -- "), and I am running v2.7.6

Upvotes: 0

Views: 1281

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177755

How do I fix this without having to look for "\ufeffOBJECTID" in my dict?

Use utf-8-sig instead of utf-8 for the decode. It automatically removes the BOM codepoint when decoding a UTF-8-encoded byte string.

Why in my second attempt is python acknowledging that it reads utf-8 (unicode) data in the row but still displays it the wrong way when I print the row as a dict?

Printing containers uses repr() when printing container items. This is so you can see the actual data in the strings. print the container items directly to see their "pretty" version.

Is that a problem with printing/storing unicode dicts in python ?

It's not a problem. It's just a display method. The data in the strings is the same.

Is this behavior expected for any container?

Yes.

Also, do not use is to test for empty string. Use:

if row['OBJECTID'] != '':

or better, since empty strings are considered false:

if row['OBJECTID']:

Upvotes: 3

wilfo
wilfo

Reputation: 715

I ran the script with your code and got the following output:

<type 'unicode'> Type de fenêtre_Cadre : <type 'unicode'> 7
<type 'unicode'> Liste des fenêtres_Hauteur : <type 'unicode'> 3.29
<type 'unicode'> Type de fenêtre_uniqueid : <type 'unicode'> uid-100
<type 'unicode'> Liste des fenêtres_Nom : <type 'unicode'> f1
<type 'unicode'>  OBJECTID : <type 'unicode'> 3760

Notice that when pasting into this text box, StackOverflow kills the unicode character before OBJECTID.

In answer to your questions I think that this behavior is reasonable since 'OBJECTID' is actually not inside row (rather, '\ufeffOBJECTID' is)

Why is that the case that Python prints the whole row with the unicode chars is probably about how __repr__' ofdict` is implemented

if you want to get rid of the unicode characters I would suugest using unidecode or a package of the sort and then you could refer directly to OBJECTID

I hope this help explains it

Upvotes: 1

Related Questions