Reputation: 408
Basically, I need to make a lookup table with non-consecutive integer IDs. I'm wondering whether, in terms of lookup speed, I'm generally better off using a dict
with integer keys anyway, or using a very long list
with a lot of empty indices. It seems to me that a list
might still be faster, as Python should know exactly where to look, but I'm wondering if there are any backend processes with the dict
to compensate and whether the additional memory requirements for those empty list
slots would negate the (probably) more easily traversed list
's speed gains. Are there any alternatives to list
s and dict
s that might be better suited to this?
I have seen this question, but it doesn't totally answer mine: Dictionary access speed comparison with integer key against string key
ETA: I'm implementing lookup tables like this twice in my program. One instance sees a max id of 5,000 with 70-100 objects populated; the other has a max id of 750 with 20-30 populated.
Upvotes: 12
Views: 5095
Reputation: 101919
To answer your question about dict
vs list
you'd have to give the exact information about the number of elements, the number of missing elements etc, so tha we could estimate exactly the memory usage of the two data structure and try to predict and/or check their performance.
In general a dict
of N
key-value pairs requires quite a bit more memory than a list
with N
values:
dict
must keep track of the keysdict
is never more than 2/3 full. When this happens the memory allocated is doubled (this is required to have O(1) amortized time operations on the dict
).However there is an alternative to these data structure which should provide very good performance: blist
. The blist
package provide an interface that matches that of list
, only it is implemented using B-trees. It can efficiently handle sparse lists. Most operations take either O(1)
or O(log n)
time, so they are quite efficient.
For example you could first create a sparse blist
doing:
from blist import blist
seq = blist([None])
seq *= 2**30 # create a 2**30 element blist. Instantaneous!
And then you can set only the indices that have a value:
for i, value in zip(indices, values):
seq[i] = value
The full documentation is here.
Note that blist
provides other efficient operations such as:
blist
s take O(log n)
time[i:j]
slice takes O(log n)
timeO(log n)
operationsO(log n)
operationsSince you gave some numbers, here's how they compare to dict
s:
>>> from blist import blist
>>> b = blist([None])
>>> b *= 5000
>>> for i in range(100):b[i] = i
...
>>> b.__sizeof__()
2660
>>> d = dict()
>>> for i in range(100):d[i] = i
...
>>> d.__sizeof__()
6216
>>> b = blist([None])
>>> b *= 750
>>> for i in range(30):b[i] = i
...
>>> b.__sizeof__()
1580
>>> d = dict()
>>> for i in range(30):d[i] = i
...
>>> d.__sizeof__()
1608
In both cases a blist
takes less memory (in the first case it takes 1/3 of the memory of the equivalent dict
). Note that the memory taken by a blist
also depends on whether or not the indices are contiguous (contiguous is better). However even using random indices:
>>> b = blist([None])
>>> b *= 5000
>>> import random
>>> for i in range(100):b[random.randint(0, 4999)] = i
...
>>> b.__sizeof__()
2916
It's still much better than the dict
.
Even lookup times are better:
In [1]: from blist import blist
...: import random
...:
In [2]: b = blist([None])
In [3]: b *= 5000
In [4]: for i in range(100):b[random.randint(0, 4999)] = i
In [5]: %timeit b[0]
10000000 loops, best of 3: 50.7 ns per loop
In [6]: d = dict()
In [7]: for i in range(100):d[random.randint(0, 4999)] = i
In [10]: %timeit d[1024] # 1024 is an existing key in this dictionary
10000000 loops, best of 3: 70.7 ns per loop
In [11]: %timeit b[1024]
10000000 loops, best of 3: 50.7 ns per loop
Note that a list
takes about 47 ns
to lookup an index on my machine, so blist
is really really close to a list
in terms of lookup performance on small lists as what you have.
Upvotes: 17
Reputation: 148880
I think there is no general answer to that question. It depends on the repartition of the integers id, the available memory and your performance requirements. The rules are :
Here is what could be a rule of thumb :
The theory of the dict is an array for which the index is the result of a hash function applied to the key. Python algorythm is correctly optimized but it is a generalist one. If you know that you have special repartition, you could try to find a hash specially adapted to your repartition. You could find pointers to go further in the Wikipedia article on Hash functions or on the good old standard C library hash
Upvotes: 1
Reputation:
Lists:
1. append
and pop
from the end of list are fast
2. insert
and pop
from the beginning of a list are slow (there is a heavy shit operation behind these 2 functions)
3. it is better to use collection.degue for the 2nd case.
Dictionaries:
4. Access operations are faster compared to lists
Looping through dictionaries and lists:
iteritems()
method to retrieve at the same time the key and its corresponding value.Upvotes: 3