Reputation: 10176
I have a dictionary master
which contains around 50000 to 100000 unique lists which can be simple lists or also lists of lists. Every list is assigned to a specific ID (which is the key of the dictionary):
master = {12: [1, 2, 4], 21: [[1, 2, 3], [5, 6, 7, 9]], ...} # len(master) is several ten thousands
Now I have a few hundreds of dictionarys which again contain around 10000 lists (same as above: can be nested). Example of one of those dicts:
a = {'key1': [6, 9, 3, 1], 'key2': [[1, 2, 3], [5, 6, 7, 9]], 'key3': [7], ...}
I want to cross-reference this data for every single dictionary in reference to my master
, i.e. instead of saving every list within a
, I want to only store the ID of the master
in case the list is present in the master
.
=> a = {'key1': [6, 9, 3, 1], 'key2': 21, 'key3': [7], ...}
I can do that by looping over all values in a
and all values of master
and try to match the lists (by sorting them), but that'll take ages.
Now I'm wondering how would you solve this?
I thought of "hashing" every list in master
to a unique string and store it as a key of a new master_inverse
reference dict, e.g.:
master_inverse = {hash([1,2,4]): 12, hash([[1, 2, 3], [5, 6, 7, 9]]): 21}
Then it would be very simple to look it up later on:
for k, v in a.items():
h = hash(v)
if h in master_inverse:
a[k] = master_inverse[h]
Do you have a better idea? How could such a hash look like? Is there a built-in-method already which is fast and unique?
EDIT: Dunno why I didn't come up instantly with this approach: What do you think of using a m5-hash of either the pickle or the repr() any single list?
Something like this:
import hashlib
def myHash(str):
return hashlib.md5(repr(str)).hexdigest()
master_inverse = {myHash(v): k for k, v in master.items()}
for k, v in a.items():
h = myHash(v)
if h in master_inverse:
a[k] = master_inverse[h]
EDIT2:
I benched it: To check one of the hundred dicts (in my example a
, a
contains for my benchmark around 20k values) against my master_inverse
is very fast, didn't expect that: 0.08sec. So I guess I can live with that well enough.
Upvotes: 3
Views: 2788
Reputation: 3105
MD5 approach will work, but you need to be cautions about very small possibility of cache collisions (see How many random elements before MD5 produces collisions? for more deitals) when using MD5 hash.
If you need to be absolutely sure that program works correctly you can convert lists to tuples and create dictionary where keys are tuples you have created and values are keys from your master dictionary (same as master_inverse
, but with full values instead of MD5 hash values).
More info on how to use tuples as dictionary keys: http://www.developer.com/lang/other/article.php/630941/Learn-to-Program-using-Python-Using-Tuples-as-Keys.htm.
Upvotes: 2