Rus925
Rus925

Reputation: 408

Better to use dict of integer keys or a very long list?

Basically, I need to make a lookup table with non-consecutive integer IDs. I'm wondering whether, in terms of lookup speed, I'm generally better off using a dict with integer keys anyway, or using a very long list with a lot of empty indices. It seems to me that a list might still be faster, as Python should know exactly where to look, but I'm wondering if there are any backend processes with the dict to compensate and whether the additional memory requirements for those empty list slots would negate the (probably) more easily traversed list's speed gains. Are there any alternatives to lists and dicts that might be better suited to this?

I have seen this question, but it doesn't totally answer mine: Dictionary access speed comparison with integer key against string key

ETA: I'm implementing lookup tables like this twice in my program. One instance sees a max id of 5,000 with 70-100 objects populated; the other has a max id of 750 with 20-30 populated.

Upvotes: 12

Views: 5095

Answers (3)

Bakuriu
Bakuriu

Reputation: 101919

To answer your question about dict vs list you'd have to give the exact information about the number of elements, the number of missing elements etc, so tha we could estimate exactly the memory usage of the two data structure and try to predict and/or check their performance.

In general a dict of N key-value pairs requires quite a bit more memory than a list with N values:

  • The dict must keep track of the keys
  • The dict is never more than 2/3 full. When this happens the memory allocated is doubled (this is required to have O(1) amortized time operations on the dict).

However there is an alternative to these data structure which should provide very good performance: blist. The blist package provide an interface that matches that of list, only it is implemented using B-trees. It can efficiently handle sparse lists. Most operations take either O(1) or O(log n) time, so they are quite efficient.

For example you could first create a sparse blist doing:

from blist import blist

seq = blist([None])
seq *= 2**30    # create a 2**30 element blist. Instantaneous!

And then you can set only the indices that have a value:

for i, value in zip(indices, values):
    seq[i] = value

The full documentation is here.

Note that blist provides other efficient operations such as:

  • Concatenating two blists take O(log n) time
  • Taking an [i:j] slice takes O(log n) time
  • Inserting an element at a given index takes O(log n) operations
  • Popping an element (from every position) takes O(log n) operations

Since you gave some numbers, here's how they compare to dicts:

>>> from blist import blist
>>> b = blist([None])
>>> b *= 5000
>>> for i in range(100):b[i] = i
... 
>>> b.__sizeof__()
2660
>>> d = dict()
>>> for i in range(100):d[i] = i
... 
>>> d.__sizeof__()
6216
>>> b = blist([None])
>>> b *= 750
>>> for i in range(30):b[i] = i
... 
>>> b.__sizeof__()
1580
>>> d = dict()
>>> for i in range(30):d[i] = i
... 
>>> d.__sizeof__()
1608

In both cases a blist takes less memory (in the first case it takes 1/3 of the memory of the equivalent dict). Note that the memory taken by a blist also depends on whether or not the indices are contiguous (contiguous is better). However even using random indices:

>>> b = blist([None])
>>> b *= 5000
>>> import random
>>> for i in range(100):b[random.randint(0, 4999)] = i
... 
>>> b.__sizeof__()
2916

It's still much better than the dict.

Even lookup times are better:

In [1]: from blist import blist
   ...: import random
   ...: 

In [2]: b = blist([None])

In [3]: b *= 5000

In [4]: for i in range(100):b[random.randint(0, 4999)] = i

In [5]: %timeit b[0]
10000000 loops, best of 3: 50.7 ns per loop

In [6]: d = dict()

In [7]: for i in range(100):d[random.randint(0, 4999)] = i

In [10]: %timeit d[1024]   # 1024 is an existing key in this dictionary
10000000 loops, best of 3: 70.7 ns per loop

In [11]: %timeit b[1024]
10000000 loops, best of 3: 50.7 ns per loop

Note that a list takes about 47 ns to lookup an index on my machine, so blist is really really close to a list in terms of lookup performance on small lists as what you have.

Upvotes: 17

Serge Ballesta
Serge Ballesta

Reputation: 148880

I think there is no general answer to that question. It depends on the repartition of the integers id, the available memory and your performance requirements. The rules are :

  • a list lookup is faster, because you do not have to compute the hash of the key.
  • a dict may be more compact if the greatest value of key is large
  • if your largest key is very large (about 2^30) you will waste a lot of memory and system will begin swapping which greatly degrades performances

Here is what could be a rule of thumb :

  • if there are "few" empty values of if you know the largest key will be "reasonably" low (relative to the memory you accept to spend on that) => use a list
  • if the following requirement is not verified and you do not have strong performance requirement => use a dict
  • if none of the 2 preceding assumptions are true you will have to try some hash functions optimizations - I detail it below

The theory of the dict is an array for which the index is the result of a hash function applied to the key. Python algorythm is correctly optimized but it is a generalist one. If you know that you have special repartition, you could try to find a hash specially adapted to your repartition. You could find pointers to go further in the Wikipedia article on Hash functions or on the good old standard C library hash

Upvotes: 1

user3522371
user3522371

Reputation:

Lists:
1. append and pop from the end of list are fast
2. insert and pop from the beginning of a list are slow (there is a heavy shit operation behind these 2 functions)
3. it is better to use collection.degue for the 2nd case.

Dictionaries:
4. Access operations are faster compared to lists



Looping through dictionaries and lists:

  1. Dictionaries use iteritems() method to retrieve at the same time the key and its corresponding value.
  2. Lists use enumerate() for the same purpose.

    Notes:
  3. If your question is just about looping speed, there is no tangible difference between iteritems() and enumerate()
  4. iteritems() is removed in Python 3.x.
  5. The zip() method is a heavy process to avoid.

Upvotes: 3

Related Questions