Reputation: 81
I am using simplejson to deserialize json string to python objects. I have a custom written object_hook that takes care of deserializing the json back to my domain objects.
The problem is, when my json string is huge (i.e. the server is returning around 800K domain objects in the form of a json string), my python deserializer is taking almost 10 minutes to deserialize them.
I drilled down a bit further and it looks like simplejson as such is not doing much work rather it's delegating everything to the object_hook. I tried optimizing my object_hook but that too is not improving my performance. (I hardly got 1 min improvement)
My question is, do we have any other standard framework that is optimized to handle huge data set or is there a way where I can utilize the framework's capability rather than doing everything at object_hook level.
I see that without object_hook the framework returns just a list of dictionaries not list of domain objects.
Any pointers here will be useful.
FYI I am using simplejson version 3.7.2
Here is my sample _object_hook:
def _object_hook(dct):
if '@CLASS' in dct: # server sends domain objects with this @CLASS
clsname = dct['@CLASS']
# This is like Class.forName (This imports the module and gives the class)
cls = get_class(clsname)
# As my server is in java, I convert the attributes to python as per python naming convention.
dct = dict( (convert_java_name_to_python(k), dct[k]) for k in dct.keys())
if cls != None:
obj_key = None
if "@uuid"in dct
obj_key = dct["@uuid"]
del(dct["@uuid"])
else:
info("Class missing uuid: " + clsname)
dct.pop("@CLASS", None)
obj = cls(**dct) #This I found to be the most time consuming process. In my domian object, in the __init__ method I have the logic to set all attributes based on the kwargs passed
if obj_key is not None:
shared_objs[obj_key] = obj #I keep all uuids along with the objects in shared_objs dictionary. This shared_objs will be used later to replace references.
else:
warning("class not found: " + clsname)
obj = dct
return obj
else:
return dct
A Sample response:
{"@CLASS":"sample.counter","@UUID":"86f26a0a-1a58-4429-a762- 9b1778a99c82","val1":"ABC","val2":1131,"val3":1754095,"value4": {"@CLASS":"sample.nestedClass","@UUID":"f7bb298c-fd0b-4d87-bed8- 74d5eb1d6517","id":1754095,"name":"XYZ","abbreviation":"ABC"}}
I have many levels of nesting and the number of records I am receiving from server is more than 800K.
Upvotes: 8
Views: 1569
Reputation: 78546
I don't know of any framework that offers what you seek out of the box, but you may apply a few optimizations to the way your class instance is setup.
Since unpacking the dictionary into keyword arguments and applying them to your class variables is taking the bulk of the time, you may consider passing the dct
directly to your class __init__
and setting up the class dictionary cls.__dict__
with dct
:
Trial 1
In [1]: data = {"name": "yolanda", "age": 4}
In [2]: class Person:
...: def __init__(self, name, age):
...: self.name = name
...: self.age = age
...:
In [3]: %%timeit
...: Person(**data)
...:
1000000 loops, best of 3: 926 ns per loop
Trial 2
In [4]: data = {"name": "yolanda", "age": 4}
In [5]: class Person2:
....: def __init__(self, data):
....: self.__dict__ = data
....:
In [6]: %%timeit
....: Person2(data)
....:
1000000 loops, best of 3: 541 ns per loop
There will be no worries about the self.__dict__
being modified via another reference since the reference to dct
is lost before _object_hook
returns.
This will of course mean changing the set up of your __init__
, with the attributes of your class strictly depending on the items in dct
. It's up to you.
You may also replace cls != None
with cls is not None
(there is only one None
object so an identity check is more pythonic):
Trial 1
In [38]: cls = 5
In [39]: %%timeit
....: cls != None
....:
10000000 loops, best of 3: 85.8 ns per loop
Trial 2
In [40]: %%timeit
....: cls is not None
....:
10000000 loops, best of 3: 57.8 ns per loop
And you can replace two lines with one with:
obj_key = dct["@uuid"]
del(dct["@uuid"])
becoming:
obj_key = dct.pop('@uuid') # Not an optimization as this is same with the above
On a scale of 800K domain objects, these would save you some good time on getting the object_hook
to create your objects more quickly.
Upvotes: 6