Abhi
Abhi

Reputation: 6255

*large* python dictionary with persistence storage for quick look-ups

I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.

  1. pickled dictionary object : not sure if this is an optimum solution for my problem
  2. NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.

Please let me know if anything is not clear.

Thanks! -Abhi

Upvotes: 27

Views: 11993

Answers (7)

VDes
VDes

Reputation: 66

If you need to handle 400 million key-value pairs efficiently with persistent storage and quick lookups, here are some options:

  1. persidict
pip install persidict
from persidict import PersiDict

db = PersiDict("data_store.json")  # Data is automatically persisted

# Insert data
db["12345"] = 67890
db["98765"] = 43210

# Fast lookup
print(db["12345"])  # Output: 67890

db.close()
  1. If you need super-fast lookups without loading everything into RAM, LMDB is a great choice.
import lmdb

env = lmdb.open("data_store.lmdb", map_size=10**9)  # 1GB max size

with env.begin(write=True) as txn:
    txn.put(b"12345", b"67890")

with env.begin() as txn:
    print(txn.get(b"12345").decode())  # Output: 67890

Upvotes: 0

Setop
Setop

Reputation: 2510

I personally use LMDB and its python binding for a few million records DB. It is extremely fast even for a database larger than the RAM. It's embedded in the process so no server is needed. Dependency are managed using pip.

The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.

Upvotes: 4

Tim Hoffman
Tim Hoffman

Reputation: 12986

No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.

From the docs https://docs.python.org/3/library/dbm.html

import dbm

# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:

    # Record some values
    db[b'hello'] = b'there'
    db['www.python.org'] = 'Python Website'
    db['www.cnn.com'] = 'Cable News Network'

    # Note that the keys are considered bytes now.
    assert db[b'www.python.org'] == b'Python Website'
    # Notice how the value is now in bytes.
    assert db['www.cnn.com'] == b'Cable News Network'

    # Often-used methods of the dict interface work too.
    print(db.get('python.org', b'not present'))

    # Storing a non-string key or value will raise an exception (most
    # likely a TypeError).
    db['www.yahoo.com'] = 4

# db is automatically closed when leaving the with statement.

I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.

Cheers

Tim

Upvotes: 14

mhawke
mhawke

Reputation: 87134

In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.

Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.

Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.

Upvotes: 13

sberry
sberry

Reputation: 132138

Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.

  1. Install redis-server
  2. Start redis server
  3. Install redis python pacakge (pip install redis)
  4. Profit.

import redis

ds = redis.Redis(host="localhost", port=6379)

with open("your_text_file.txt") as fh:
    for line in fh:
        line = line.strip()
        k, _, v = line.partition("=")
        ds.set(k, v)

Above assumes a files of values like:

key1=value1
key2=value2
etc=etc

Modify insertion script to your needs.


import redis
ds = redis.Redis(host="localhost", port=6379)

# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
    val = ds.get(mykey)

Why I like Redis.

  1. Configurable persistance options
  2. Blazingly fast
  3. Offers more than just key / value pairs (other data types)
  4. @antrirez

Upvotes: 5

steveha
steveha

Reputation: 76765

I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.

This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.

What are the performance characteristics of sqlite with very large database files?

Upvotes: 4

Sam Mussmann
Sam Mussmann

Reputation: 5993

If you want to persist a large dictionary, you are basically looking at a database.

Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.

Upvotes: 20

Related Questions