LevelDB for 100s of millions entries

Question

What are the top factors to consider when tuning inserts for a LevelDB store?

I'm inserting 500M+ records in the form:

key="rs1234576543" very predictable structure. rs<1+ digits>
value="1,20000,A,C" string can be much longer but usually ~ 40 chars
keys are unique
key insert order is random

into a LevelDB store using the python plyvel, and see dramatic drop in speed as the number of records grows. I guess this is expected but are there tuning measures I could look at to make it scale better?

Example code:

import plyvel
BATCHSIZE = 1000000

db = plyvel.DB('/tmp/lvldbSNP151/', create_if_missing=True)
wb = db.write_batch()
# items not in any key order
for key, value in DBSNPfile:
    wb.put(key,value)
    if i%BATCHSIZE==0:
        wb.write()
wb.write()

I've tried various batch sizes, which helps bit, but am hoping there's something else I've missed. For example, can knowing the max length of a key (or value) be leveraged?

wouter bolsterlee · Accepted Answer

(Plyvel author here.)

LevelDB keeps all database items in sorted order. Since you are writing in a random order, this basically means that all parts of the database get rewritten all the time since LevelDB has to merge SSTs (this happens in the background). Once your database gets larger, and you keep adding more items to it, this results in a reduced write throughput.

I suspect that performance will not degrade as badly if you have better locality of your writes.

Other ideas that may be worth trying out are:

increase the write_buffer_size
increase the max_file_size
experiment with a larger block_size
use .write_batch(sync=False)

The above can all be used from Python using extra keyword arguments to plyvel.DB and to the .write_batch() method. See the api docs for details.

LevelDB for 100s of millions entries

Answers (1)

Related Questions