lukecampbell
lukecampbell

Reputation: 15256

Python force dict entries to be utf-8

I spent the better part of an afternoon trying to patch dictionary objects to be utf-8 encoded in lieu of unicode. I am trying to find the fastest and best performing way to extend a dictionary object and ensure that it's entries, keys and values are both utf-8.

Here is what I have come up with, it does the job but I'm wondering what improvements could be made.

class UTF8Dict(dict):
    def __init__(self, *args, **kwargs):
        d = dict(*args, **kwargs)
        d = _decode_dict(d)
        super(UTF8Dict,self).__init__(d)
    def __setitem__(self,key,value):
        if isinstance(key,unicode):
            key = key.encode('utf-8')
        if isinstance(value,unicode):
            value = value.encode('utf-8')
        return super(UTF8Dict,self).__setitem__(key,value)

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

Suggestions that improve any of the following would be very helpful:

Upvotes: 3

Views: 8472

Answers (1)

Edward Loper
Edward Loper

Reputation: 15944

I agree with the comments that say that this may be misguided. That said, here are some holes in your current scheme:

  1. d.setdefault can be used to add unicode objects to your dict:

    >>> d = UTF8Dict()
    >>> d.setdefault(u'x', u'y')
    
  2. d.update can be used to add unicode objects to your dict:

    >>> d = UTF8Dict()
    >>> d.update({u'x': u'y'})
    
  3. the list values contained in a dict could be modified to include unicode objects, using any standard list operations. E.g.:

    >>> d = UTF8Dict(x=[])
    >>> d['x'].append(u'x')
    

Why do you want to ensure that your data structure contains only utf-8 strings?

Upvotes: 4

Related Questions