Reputation: 31
I'm seeing some rather improbable performance results from the embedded Neo4j, on the surface it's orders of magnitude slower than expected so I'm assuming I'm "doing it wrong", although I'm not doing anything complicated.
I'm using the latest embedded python bindings for Neo4j (https://github.com/neo4j/python-embedded)
from neo4j import GraphDatabase
db = GraphDatabase('/tmp/neo4j')
I've created fake 1500 products with simple attributes:
fake_products = [{'name':str(x)} for x in range(0,1500)]
... and created nodes out of them that I connected to a subreference node:
with db.transaction:
products = db.node()
db.reference_node.PRODUCTS(products)
for prod_def in fake_products:
product = db.node(name=prod_def['name'])
product.INSTANCE_OF(products)
Now with what looks, to me, as almost exactly the same kind of code I've seen in the documentation:
PRODUCTS = db.getNodeById(1)
for x in PRODUCTS.INSTANCE_OF.incoming:
pass
... iterating through these 1500 nodes takes >0.2s on my Macbook Pro. WHAT. (EDIT: I of course ran this query a bunch of times so at least in the python bindings it's not a matter of cold caches)
I amped it up to 15k, it took 2s. I downloaded Gremlin and issued an equivalent query to investigate if it's neo4j or the python bindings:
g.v(1).in("INSTANCE_OF")
.. it seems it took about 2s on the first try, on the second run it seemed to complete almost immediately.
Any idea why it's so slow? The results I'm getting have got to be some kind of a mistake on my part.
Upvotes: 3
Views: 1407
Reputation: 6331
This is Neo4j loading data lazily and not doing any prefetching. On the first run, you are hitting the disk, on the second, the caches are warm, which is your real production scenario.
Upvotes: 1