Reputation: 6197
I have about 20 million key-value pairs . I need to create two dictionaries.
First dictionary: The values are ints, from 0 to 20 million. The keys are strings of length 40 characters, for example '36ae99662ec931a3c20cffdecb39b69a8f7f23fd'.
Second dictionary: Reverse of the first dictionary. The keys are ints, from 0 to 20 million. The values are strings of length 40 characters, for example '36ae99662ec931a3c20cffdecb39b69a8f7f23fd'.
I think for the second dictionary, there are more options, since the index can just be used as the key. For the second option, it looks like sqlite3 is promising.
Lookup speed is not too important, 1 second look up should be okay. The main concern is I don't have too much space to store the dictionary.
As for my best guess for the first type of dictionary, From this SO post
*large* python dictionary with persistence storage for quick look-ups
It looks like dbm would be a decent solution for the first type of dictionary since all the keys and values are stored as bytes, though the answer was given 7 years ago in 2012. I am not sure if it is a decent solution today.
Upvotes: 1
Views: 624
Reputation: 43495
The strings look hexadecimal. In that case you could start by using binascii.unhexlify
to convert them to binary strings. That is a 50% space saving right there.
In [2]: import binascii
In [3]: binascii.unhexlify('36ae99662ec931a3c20cffdecb39b69a8f7f23fd')
Out[3]: b'6\xae\x99f.\xc91\xa3\xc2\x0c\xff\xde\xcb9\xb6\x9a\x8f\x7f#\xfd'
In [4]: len(binascii.unhexlify('36ae99662ec931a3c20cffdecb39b69a8f7f23fd'))
Out[4]: 20
20 million key/value pairs isn't all that much for a modern computer. Looking at the size of the pure data (20 bytes for the string, 4 bytes for the integer) we're talking around half a GB.
In [5]: 20e6 * (20 + 4) / 1e9
Out[5]: 0.48
The most space efficient way is to just make an array of key/value pairs, sorted by the key. Since we know that every pair is 24 bytes, accessing them in a mmapped file is trivial; you can just use slicing. And I would use a binary search for look-up.
This has no storage overhead. But inserting a value would be inefficient.
Upvotes: 1
Reputation: 770
Considering your second dictionary is just a reverse of the first, I think you probably want to use a single table database. You can have a primary key and then an index on the strings as well for fast lookups. Something like sqlite makes sense.
What size memory are you dealing with? It could still be in memory for python but all depends on how much memory you have.
Upvotes: 1