damzam
damzam

Reputation: 1961

Is the defaultdict in Python's collections module really faster than using setdefault?

I've seen other Python programmers use defaultdict from the collections module for the following use case:

from collections import defaultdict

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

def main():
    d = defaultdict(list)
    for k, v in s:
        d[k].append(v)

I've typically approached this problem by using setdefault instead:

def main():
    d = {}
    for k, v in s:
        d.setdefault(k, []).append(v)

The docs do in fact claim that using defaultdict is faster, but I've seen the opposite to be true when testing myself:

$ python -mtimeit -s "from withsetdefault import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 4.51 usec per loop
$ python -mtimeit -s "from withdefaultdict import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 5.38 usec per loop

Is there something wrong with how I've set up the tests?

For reference, I'm using Python 2.7.3 [GCC 4.2.1 (Apple Inc. build 5666)

Upvotes: 21

Views: 10885

Answers (2)

Dhruv
Dhruv

Reputation: 405

Adding to Tim's answer, when you make multiple defaultdicts, adding elements becomes costlier. Though initialization adds time, the main difference comes from the time taken to add elements to the multiple defaultdicts.

Here's some code to show the observation

import time
from collections import defaultdict

def time_taken_mult_dd(sz):
    t1 = t2 = 0
    for i1 in range(1000000):
        start1 = time.time()
        d = defaultdict(int)
        end1 = time.time()
        
        start2 = time.time()
        for i2 in range(sz):
            d[i2] += 1
        end2 = time.time()
        
        t1 += end1 - start1
        t2 += end2 - start2
    
    print(f"Time taken for initialization {t1:5.2f}e-06")
    print(f"Time taken for addition       {t2:5.2f}e-06")

def time_taken_one_dd(sz):
    t = 0
    d = defaultdict(int)
    for i1 in range(1000000):
        
        start = time.time()
        for i2 in range(sz):
            d[i2] += 1
        end = time.time()
        
        t += end - start
    
    print(f"Time taken for addition       {t:5.2f}e-06")

And here's the output. Observe how initialization time remains same while adding elements become more time consuming

>>> time_taken_one_dd(10)
Time taken for addition        1.59e-06
>>> time_taken_mult_dd(10)
Time taken for initialization  0.41e-06
Time taken for addition        2.64e-06
>>> 
>>> time_taken_one_dd(50)
Time taken for addition        6.46e-06
>>> time_taken_mult_dd(50)
Time taken for initialization  0.59e-06
Time taken for addition       11.80e-06
>>> 
>>> time_taken_one_dd(100)
Time taken for addition       12.61e-06
>>> time_taken_mult_dd(100)
Time taken for initialization  0.71e-06
Time taken for addition       23.05e-06

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336198

Yes, there is something "wrong":

You have put the creation of the (default)dict into the statement instead of the setup. Constructing a new defaultdict is more expensive than a normal dict, and usually that's not the bottleneck you should be profiling in a program - after all, you build your data structures once but you use them many times.

If you do your tests like below, you see that defaultdict operations are indeed faster:

>>> import timeit
>>> setup1 = """from collections import defaultdict
... s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = defaultdict(list)"""
>>> stmt1 = """for k, v in s:
...     d[k].append(v)"""
>>> setup2 = """s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = {}"""
>>> stmt2 = """for k, v in s:
...     d.setdefault(k, []).append(v)"""
>>> timeit.timeit(setup=setup1, stmt=stmt1)
1.0283400125194078
>>> timeit.timeit(setup=setup2, stmt=stmt2)
1.7767367580925395

Python 2.7.3 on Win7 x64.

Upvotes: 26

Related Questions