Reputation: 1961
I've seen other Python programmers use defaultdict from the collections module for the following use case:
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
def main():
d = defaultdict(list)
for k, v in s:
d[k].append(v)
I've typically approached this problem by using setdefault instead:
def main():
d = {}
for k, v in s:
d.setdefault(k, []).append(v)
The docs do in fact claim that using defaultdict is faster, but I've seen the opposite to be true when testing myself:
$ python -mtimeit -s "from withsetdefault import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 4.51 usec per loop
$ python -mtimeit -s "from withdefaultdict import main; s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)];" "main()"
100000 loops, best of 3: 5.38 usec per loop
Is there something wrong with how I've set up the tests?
For reference, I'm using Python 2.7.3 [GCC 4.2.1 (Apple Inc. build 5666)
Upvotes: 21
Views: 10885
Reputation: 405
Adding to Tim's answer, when you make multiple defaultdicts, adding elements becomes costlier. Though initialization adds time, the main difference comes from the time taken to add elements to the multiple defaultdicts.
Here's some code to show the observation
import time
from collections import defaultdict
def time_taken_mult_dd(sz):
t1 = t2 = 0
for i1 in range(1000000):
start1 = time.time()
d = defaultdict(int)
end1 = time.time()
start2 = time.time()
for i2 in range(sz):
d[i2] += 1
end2 = time.time()
t1 += end1 - start1
t2 += end2 - start2
print(f"Time taken for initialization {t1:5.2f}e-06")
print(f"Time taken for addition {t2:5.2f}e-06")
def time_taken_one_dd(sz):
t = 0
d = defaultdict(int)
for i1 in range(1000000):
start = time.time()
for i2 in range(sz):
d[i2] += 1
end = time.time()
t += end - start
print(f"Time taken for addition {t:5.2f}e-06")
And here's the output. Observe how initialization time remains same while adding elements become more time consuming
>>> time_taken_one_dd(10)
Time taken for addition 1.59e-06
>>> time_taken_mult_dd(10)
Time taken for initialization 0.41e-06
Time taken for addition 2.64e-06
>>>
>>> time_taken_one_dd(50)
Time taken for addition 6.46e-06
>>> time_taken_mult_dd(50)
Time taken for initialization 0.59e-06
Time taken for addition 11.80e-06
>>>
>>> time_taken_one_dd(100)
Time taken for addition 12.61e-06
>>> time_taken_mult_dd(100)
Time taken for initialization 0.71e-06
Time taken for addition 23.05e-06
Upvotes: 0
Reputation: 336198
Yes, there is something "wrong":
You have put the creation of the (default)dict
into the statement instead of the setup. Constructing a new defaultdict
is more expensive than a normal dict
, and usually that's not the bottleneck you should be profiling in a program - after all, you build your data structures once but you use them many times.
If you do your tests like below, you see that defaultdict
operations are indeed faster:
>>> import timeit
>>> setup1 = """from collections import defaultdict
... s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = defaultdict(list)"""
>>> stmt1 = """for k, v in s:
... d[k].append(v)"""
>>> setup2 = """s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
... d = {}"""
>>> stmt2 = """for k, v in s:
... d.setdefault(k, []).append(v)"""
>>> timeit.timeit(setup=setup1, stmt=stmt1)
1.0283400125194078
>>> timeit.timeit(setup=setup2, stmt=stmt2)
1.7767367580925395
Python 2.7.3 on Win7 x64.
Upvotes: 26