user213060
user213060

Reputation: 1249

What is a hashtable/dictionary implementation for Python that doesn't store the keys?

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.

What is an implementation of this for Python? Is there a name for this?

Yes, collisions are allowed and can be ignored.

(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)

Upvotes: 6

Views: 2438

Answers (10)

user227667
user227667

Reputation:

If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))

If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).

Upvotes: 1

steveha
steveha

Reputation: 76745

Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.

If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:

http://en.wikipedia.org/wiki/Trie

Upvotes: 1

jfs
jfs

Reputation: 414715

Bloomier filters - space-efficient associative array

From the Wikipedia:

Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an associative array. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a false positive is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that is in the map.

Upvotes: 5

Jochen Ritzel
Jochen Ritzel

Reputation: 107696

Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.

Upvotes: 3

Victor Olex
Victor Olex

Reputation: 1498

It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.

Upvotes: 1

Mark Byers
Mark Byers

Reputation: 838944

How about using an ordinary dictionary and instead of doing:

d[x]=y

use:

d[hash(x)]=y

To look up:

d[hash(foo)]

Of course, if there is a hash collision, you may get the wrong value back.

Upvotes: 3

rob
rob

Reputation: 37654

Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).

In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).

Upvotes: 2

martinr
martinr

Reputation: 3812

Build your own b-tree in RAM.

Memory use:

(4 bytes) comparison hash value

(4 bytes) index of next leaf if hash <= comparison OR if negative index of value

(4 bytes) index of next leaf if hash > comparison OR if negative index of value

12 bytes per b-tree node for the b-tree. More overhead for the values (see below).

How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.

Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.

This assumes a 32bit hash. You will need more bytes per b-tree node if you have greater than 2^31-1 entries or a larger hash.

BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)

Upvotes: 2

Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.

There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.

Upvotes: 1

jbochi
jbochi

Reputation: 29654

Why not a dictionary + hashlib?

>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
    return hashlib.sha224(obj).hexdigest()

>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

Upvotes: 0

Related Questions