wydadman
wydadman

Reputation: 36

I use an offset to access elements in LMDB database but it is so slow. Why is that?

I have put my full LMDB database into a single extremly large key/value bytestring array associated with a single key (the only one in my LMDB database). Thus I access the values that I need by an offset, the offset is an index at the array as you can see in the code snippet. With such a structure my access time should be O(1). The problem is that when I query my database it is so slow. I have absolutely no idea why does is it take so long. Is it a good idea to store my huge array in a single key in the first place? Is there a particular mechanism in python that makes accessening an element by its index in an array so slow? Is the data not contiguous? I am struglling figuring out what is wrong, please help!

env = lmdb.open('light')
with env.begin(write=False,buffers=True) as txn:
    cursor=txn.cursor()
    cursor.first()
    for i in range(18000000):       #I have around 180000 element
        cursor.value()[4*i:4*i+4]   #this loop last an eternity

Upvotes: 1

Views: 1203

Answers (1)

zwol
zwol

Reputation: 140748

I think the problem is that cursor.value() is expensive. I don't know enough about the guts of LMDB or its Python bindings to know how much work it has to do, but it could be doing a partial B-tree traversal, invoking the OS to set up memory mappings, constructing complicated proxy objects, perhaps even copying the entire array out of LMDB into a Python buffer object. And you're calling it on every iteration of the loop, so it has to repeat that work every time. Destroying the object returned by cursor.value() may also be expensive and you're repeating that work every time too.

If I'm right, you should be able to get a substantial speedup by hoisting the invocation of value() out of the loop:

env = lmdb.open('light')
with env.begin(write=False,buffers=True) as txn:
    cursor=txn.cursor()
    if cursor.first():
        data = cursor.value()
        for i in range(18000000):
            data[4*i:4*i+4]

Python's interpreter is not very efficient and its bytecode compiler doesn't do very many optimizations, so you will probably see a small but measurable further speedup from using three-argument range to avoid having to multiply by 4 twice on every loop iteration:

env = lmdb.open('light')
with env.begin(write=False,buffers=True) as txn:
    cursor=txn.cursor()
    if cursor.first():
        data = cursor.value()
        for i in range(0, 18000000*4, 4):
            data[i:i+4]

Upvotes: 1

Related Questions