Why does it take longer to dump and load with pickle.HIGHEST_PROTOCOL?

Question

I test three different protocols with pickle; 0, 1, 2.

In my test I dumped and loaded one dict of about 270000 (int, int) pairs and one set with about 560000 int.

Following is my testing code (you can safely skip the two fetch functions which I used to fetch data from database):

protocol = 0 # Tested 0, 1, and 2
print 'Protocol:', protocol
t0 = time.time()
sku2spu_dict = fetch_sku2spu_dict()
pid_set = fetch_valid_pids()
t1 = time.time()
print 'Time in sql:', t1 - t0
pickle.dump(sku2spu_dict, open('sku.pcike_dict', 'w'), protocol)
pickle.dump(pid_set, open('pid.picke_set', 'w'), protocol)
t2 = time.time()
print 'Time in dump:', t2 - t1
sku2spu_dict = pickle.load(open('sku.pcike_dict', 'r'))
pid_set = pickle.load(open('pid.picke_set', 'r'))
t3 = time.time()
print 'Time in load:', t3 - t2

And following is the time spent by each one:

Protocol: 0
Time in dump: 31.3491470814
Time in load: 29.8991980553

Protocol: 1
Time in dump: 32.3191611767
Time in load: 20.6666529179

Protocol: 2
Time in dump: 94.2163629532
Time in load: 42.7647490501

To my great surprise, protocol 2 is MUCH worse than 0 and 1.

However, the dumped file size is the smallest with protocol 2, which is about half of protocol 0 and 1.

In the documentation, it says:

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.

And for the definition of new-style classes, it says:

Any class which inherits from object. This includes all built-in types like list and dict

So I expected protocol 2 to be faster in dumping and loading objects.

Anyone know why?

UPDATE：

Problem solved after replacing pickle with cPickle.

Now load and dump take 5 and 3 seconds with protocol 2, whereas protocol 0 and 1 takes more than 10 seconds.

Bakuriu · Accepted Answer

When the documentation talks about "new-style classes" it (probably) refers to user defined new-style classes. And if you do a simple benchmark with them you can see that protocol 2 is two times faster than protocol 0 in dumping them:

>>> import cPickle
>>> import timeit
>>> class MyObject(object):
...     def __init__(self, val):
...             self.val = val
...     def method(self):
...             print self.val
... 
>>> timeit.timeit('cPickle.dumps(MyObject(100), 0)', 'from __main__ import cPickle, MyObject')
17.654622077941895
>>> timeit.timeit('cPickle.dumps(MyObject(100), 1)', 'from __main__ import cPickle, MyObject')
14.536609172821045
>>> timeit.timeit('cPickle.dumps(MyObject(100), 2)', 'from __main__ import cPickle, MyObject')
8.885567903518677

Also loading results in a 2x speed up:

>>> dumped = cPickle.dumps(MyObject(100), 0)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
4.6161839962005615
>>> dumped = cPickle.dumps(MyObject(100), 1)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
4.351701021194458
>>> dumped = cPickle.dumps(MyObject(100), 2)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
2.3936450481414795

In your special case it might be the opposite but we can't say anything without the code that defines fetch_sku2spu_dict etc. The only thing I may assume is that the returned value is a dict, but in that case protocol 2 is about 6 times faster:

>>> mydict = dict(zip(range(100), range(100)))
>>> timeit.timeit('cPickle.dumps(mydict, 0)', 'from __main__ import cPickle, mydict')
46.335021018981934
>>> timeit.timeit('cPickle.dumps(mydict, 1)', 'from __main__ import cPickle, mydict')
7.913743019104004
>>> timeit.timeit('cPickle.dumps(mydict, 2)', 'from __main__ import cPickle, mydict')
7.798863172531128

And it's about 2.5 times faster on loading:

>>> dumped = cPickle.dumps(mydict, 0)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
32.81050395965576
>>> dumped = cPickle.dumps(mydict, 1)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
13.997781038284302
>>> dumped = cPickle.dumps(mydict, 2)
>>> timeit.timeit('cPickle.loads(dumped)', 'from __main__ import cPickle, dumped')
14.006750106811523

On the other side, when using the python version of the module I found out that:

>>> mydict = dict(zip(range(100), range(100)))
>>> timeit.timeit('pickle.dumps(mydict,0)', 'from __main__ import pickle, mydict', number=10000)
2.9552500247955322
>>> timeit.timeit('pickle.dumps(mydict,1)', 'from __main__ import pickle, mydict', number=10000)
3.831756830215454
>>> timeit.timeit('pickle.dumps(mydict,2)', 'from __main__ import pickle, mydict', number=10000)
3.842888116836548

So it seems like dumping built-in objects with protocol 1 and 2 is slower than using protocol 0 with the python version. But when loading objects protocol 0 is again the slowest of the three:

>>> dumped = pickle.dumps(mydict, 0)
>>> timeit.timeit('pickle.loads(dumped)', 'from __main__ import pickle, dumped', number=10000)
2.988792896270752
>>> dumped = pickle.dumps(mydict, 1)
>>> timeit.timeit('pickle.loads(dumped)', 'from __main__ import pickle, dumped', number=10000)
1.2793281078338623
>>> dumped = pickle.dumps(mydict, 2)
>>> timeit.timeit('pickle.loads(dumped)', 'from __main__ import pickle, dumped', number=10000)
1.5425071716308594

As you can see by the above mini-benchmarks the time taken to pickle depends on a number of factors, from the type of object you are pickling to which version of the pickle module you use. Without further information we wont be able to explain why in your case protocol 2 is so much slower.

Why does it take longer to dump and load with pickle.HIGHEST_PROTOCOL?

Answers (1)

Related Questions