Reputation: 15256
I spent the better part of an afternoon trying to patch dictionary objects to be utf-8 encoded in lieu of unicode. I am trying to find the fastest and best performing way to extend a dictionary object and ensure that it's entries, keys and values are both utf-8.
Here is what I have come up with, it does the job but I'm wondering what improvements could be made.
class UTF8Dict(dict):
def __init__(self, *args, **kwargs):
d = dict(*args, **kwargs)
d = _decode_dict(d)
super(UTF8Dict,self).__init__(d)
def __setitem__(self,key,value):
if isinstance(key,unicode):
key = key.encode('utf-8')
if isinstance(value,unicode):
value = value.encode('utf-8')
return super(UTF8Dict,self).__setitem__(key,value)
def _decode_list(data):
rv = []
for item in data:
if isinstance(item, unicode):
item = item.encode('utf-8')
elif isinstance(item, list):
item = _decode_list(item)
elif isinstance(item, dict):
item = _decode_dict(item)
rv.append(item)
return rv
def _decode_dict(data):
rv = {}
for key, value in data.iteritems():
if isinstance(key, unicode):
key = key.encode('utf-8')
if isinstance(value, unicode):
value = value.encode('utf-8')
elif isinstance(value, list):
value = _decode_list(value)
elif isinstance(value, dict):
value = _decode_dict(value)
rv[key] = value
return rv
Suggestions that improve any of the following would be very helpful:
Upvotes: 3
Views: 8472
Reputation: 15944
I agree with the comments that say that this may be misguided. That said, here are some holes in your current scheme:
d.setdefault
can be used to add unicode objects to your dict:
>>> d = UTF8Dict()
>>> d.setdefault(u'x', u'y')
d.update
can be used to add unicode objects to your dict:
>>> d = UTF8Dict()
>>> d.update({u'x': u'y'})
the list values contained in a dict could be modified to include unicode objects, using any standard list operations. E.g.:
>>> d = UTF8Dict(x=[])
>>> d['x'].append(u'x')
Why do you want to ensure that your data structure contains only utf-8 strings?
Upvotes: 4