Reputation: 21501
I test three different protocols with pickle; 0, 1, 2.
In my test I dumped and loaded one dict
of about 270000 (int
, int
) pairs and one set
with about 560000 int
.
Following is my testing code (you can safely skip the two fetch functions which I used to fetch data from database):
protocol = 0 # Tested 0, 1, and 2
print 'Protocol:', protocol
t0 = time.time()
sku2spu_dict = fetch_sku2spu_dict()
pid_set = fetch_valid_pids()
t1 = time.time()
print 'Time in sql:', t1 - t0
pickle.dump(sku2spu_dict, open('sku.pcike_dict', 'w'), protocol)
pickle.dump(pid_set, open('pid.picke_set', 'w'), protocol)
t2 = time.time()
print 'Time in dump:', t2 - t1
sku2spu_dict = pickle.load(open('sku.pcike_dict', 'r'))
pid_set = pickle.load(open('pid.picke_set', 'r'))
t3 = time.time()
print 'Time in load:', t3 - t2
And following is the time spent by each one:
Protocol: 0
Time in dump: 31.3491470814
Time in load: 29.8991980553
Protocol: 1
Time in dump: 32.3191611767
Time in load: 20.6666529179
Protocol: 2
Time in dump: 94.2163629532
Time in load: 42.7647490501
To my great surprise, protocol 2 is MUCH worse than 0 and 1.
However, the dumped file size is the smallest with protocol 2, which is about half of protocol 0 and 1.
In the documentation, it says:
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.
And for the definition of new-style classes
, it says:
Any class which inherits from object. This includes all built-in types like list and dict
So I expected protocol 2 to be faster in dumping and loading objects.
Anyone know why?
UPDATE:
Problem solved after replacing pickle
with cPickle
.
Now load
and dump
take 5 and 3 seconds with protocol 2, whereas protocol 0 and 1 takes more than 10 seconds.
Upvotes: 1
Views: 2892
Reputation: 101969
When the documentation talks about "new-style classes" it (probably) refers to user defined new-style classes. And if you do a simple benchmark with them you can see that protocol 2 is two times faster than protocol 0 in dumping them:
>>> import cPickle
>>> import timeit
>>> class MyObject(object):
... def __init__(self, val):
... self.val = val
... def method(self):
... print self.val
...
>>> timeit.timeit('cPickle.dumps(MyObject(100), 0)', 'from __main__ import cPickle, MyObject')
17.654622077941895
>>> timeit.timeit('cPickle.dumps(MyObject(100), 1)', 'from __main__ import cPickle, MyObject')
14.536609172821045
>>> timeit.timeit('cPickle.dumps(MyObject(100), 2)', 'from __main__ import cPickle, MyObject')
8.885567903518677
Also loading results in a 2x speed up:
>>> dumped = cPickle.dumps(MyObject(100), 0)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
4.6161839962005615
>>> dumped = cPickle.dumps(MyObject(100), 1)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
4.351701021194458
>>> dumped = cPickle.dumps(MyObject(100), 2)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
2.3936450481414795
In your special case it might be the opposite but we can't say anything without the code that defines fetch_sku2spu_dict
etc. The only thing I may assume is that the returned value is a dict
, but in that case protocol 2 is about 6 times faster:
>>> mydict = dict(zip(range(100), range(100)))
>>> timeit.timeit('cPickle.dumps(mydict, 0)', 'from __main__ import cPickle, mydict')
46.335021018981934
>>> timeit.timeit('cPickle.dumps(mydict, 1)', 'from __main__ import cPickle, mydict')
7.913743019104004
>>> timeit.timeit('cPickle.dumps(mydict, 2)', 'from __main__ import cPickle, mydict')
7.798863172531128
And it's about 2.5 times faster on loading:
>>> dumped = cPickle.dumps(mydict, 0)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
32.81050395965576
>>> dumped = cPickle.dumps(mydict, 1)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
13.997781038284302
>>> dumped = cPickle.dumps(mydict, 2)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
14.006750106811523
On the other side, when using the python version of the module I found out that:
>>> mydict = dict(zip(range(100), range(100)))
>>> timeit.timeit('pickle.dumps(mydict,0)', 'from __main__ import pickle, mydict', number=10000)
2.9552500247955322
>>> timeit.timeit('pickle.dumps(mydict,1)', 'from __main__ import pickle, mydict', number=10000)
3.831756830215454
>>> timeit.timeit('pickle.dumps(mydict,2)', 'from __main__ import pickle, mydict', number=10000)
3.842888116836548
So it seems like dumping built-in objects with protocol 1 and 2 is slower than using protocol 0 with the python version. But when loading objects protocol 0 is again the slowest of the three:
>>> dumped = pickle.dumps(mydict, 0)
>>> timeit.timeit('pickle.loads(dumped)', 'from __main__ import pickle, dumped', number=10000)
2.988792896270752
>>> dumped = pickle.dumps(mydict, 1)
>>> timeit.timeit('pickle.loads(dumped)', 'from __main__ import pickle, dumped', number=10000)
1.2793281078338623
>>> dumped = pickle.dumps(mydict, 2)
>>> timeit.timeit('pickle.loads(dumped)', 'from __main__ import pickle, dumped', number=10000)
1.5425071716308594
As you can see by the above mini-benchmarks the time taken to pickle depends on a number of factors, from the type of object you are pickling to which version of the pickle module you use. Without further information we wont be able to explain why in your case protocol 2 is so much slower.
Upvotes: 2