I'm wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values. I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M. So far I have tried: Redis (64GB RAM is not large enough) TileDB SparseArray (no way to add to values) Sqlite (way too slow) LMDB (batching the 300 million read/write in transactions takes multiple hours to execute) Zarr (coordinate based updating is SUPER slow) Scipy .npz (can't keep the matrices in memory for addition part) sparse COO with memmapped coords and data (RAM usage is massive when adding matrices) Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I'm processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant. Any ideas?

pythonarrayssparse-matrixkey-valueon-disk

Reputation: 101

Fast key-value disk storage for Python

I'm wondering if there is a fast on-disk key-value storage with Python bindings which supports millions of read/write calls to separate keys. My problem involves counting word co-occurrences in a very large corpora (Wikipedia), and continually updating co-occurrence counts. This involves reading and writing ~300 million values 70 times with 64 bit keys, and 64 bit values.

I can also represent my data as an upper-triangular sparse matrix with dimensions ~ 2M x 2M.

So far I have tried:

Redis (64GB RAM is not large enough)
TileDB SparseArray (no way to add to values)
Sqlite (way too slow)
LMDB (batching the 300 million read/write in transactions takes multiple hours to execute)
Zarr (coordinate based updating is SUPER slow)
Scipy .npz (can't keep the matrices in memory for addition part)
sparse COO with memmapped coords and data (RAM usage is massive when adding matrices)

Right now the only solution which works well enough is LMDB, but the runtime is ~12 days which seems unreasonable since it does not feel like I'm processing that much data. Saving the sub-matrix (with ~300M values) to disk using .npz is almost instant.

Any ideas?

Upvotes: 7

Answers (3)

pufferfish

Reputation: 17415

Have a look at Plyvel, which is a python interface to LevelDB.

I used it successfully several years ago, and both projects appear to still be active. My own use-case was storing 100s of millions of key:value pairs, and I was more focussed on read performance, but it looks optimized for write also.

Upvotes: 1

Congyu WANG

Reputation: 71

You might want to check out my project.

pip install rocksdict

This is a fast on-disk key-value storage based on RockDB, it can take any python object as value. I consider it to be reliable and easy to use. It has a performance that's on par with GDBM, but it is cross-platform compared to GDBM which is only available for python on Linux.

https://github.com/Congyuwang/RocksDict

Below is a demo:

from rocksdict import Rdict, Options

path = str("./test_dict")

# create a Rdict with default options at `path`
db = Rdict(path)

db[1.0] = 1
db[1] = 1.0
db["huge integer"] = 2343546543243564534233536434567543
db["good"] = True
db["bad"] = False
db["bytes"] = b"bytes"
db["this is a list"] = [1, 2, 3]
db["store a dict"] = {0: 1}

import numpy as np
db[b"numpy"] = np.array([1, 2, 3])

import pandas as pd
db["a table"] = pd.DataFrame({"a": [1, 2], "b": [2, 1]})

# close Rdict
db.close()

# reopen Rdict from disk
db = Rdict(path)
assert db[1.0] == 1
assert db[1] == 1.0
assert db["huge integer"] == 2343546543243564534233536434567543
assert db["good"] == True
assert db["bad"] == False
assert db["bytes"] == b"bytes"
assert db["this is a list"] == [1, 2, 3]
assert db["store a dict"] == {0: 1}
assert np.all(db[b"numpy"] == np.array([1, 2, 3]))
assert np.all(db["a table"] == pd.DataFrame({"a": [1, 2], "b": [2, 1]}))

# iterate through all elements
for k, v in db.items():
    print(f"{k} -> {v}")

# batch get:
print(db[["good", "bad", 1.0]])
# [True, False, 1]
 
# delete Rdict from dict
del db
Rdict.destroy(path)

Upvotes: 1

Majid soorani

Reputation: 189

PySpark is more useful here .

PairFunction<String, String, String> keyData =
  new PairFunction<String, String, String>() {
  public Tuple2<String, String> call(String x) {
    return new Tuple2(x.split(" ")[0], x);
  }
};

JavaPairRDD<String, String> pairs = lines.mapToPair(keyData); https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html

Upvotes: -2

Fast key-value disk storage for Python

Answers (3)

Related Questions