Reputation: 43
I have the following Cypher query
MATCH (p1:`Article` {article_id:'1234'})--(a1:`Author` {name:'Jones, P'})
MATCH (p2:`Article` {article_id:'5678'})--(a2:`Author` {name:'Jones, P'})
MATCH (p1)-[:WRITTEN_BY]->(c1:`Author`)-[h1:HAS_NAME]->(l1)
MATCH (p2)-[:WRITTEN_BY]->(c2:`Author`)-[h2:HAS_NAME]->(l2)
WHERE l1=l2 AND c1<>a1 AND c2<>a2
RETURN c1.FullName, c2.FullName, h1.distance + h2.distance
On my local Neo4j server, running this query takes ~4 seconds and PROFILE
shows >3 million db hits. If I don't specify the Author
label on c1
and c2
(it's redundant thanks to the relationship labels), the same query returns the same output in 33ms, and PROFILE
shows <200 db hits.
When I run the same two queries on a larger version of the same database that's hosted on a remote server, this difference in performance vanishes.
Both dbs have the same constraints and indexes. Any ideas what else might be going wrong?
Upvotes: 1
Views: 148
Reputation: 2507
Your query has a lot of unnecessary stuff in it, so first off, here's a cleaner version of it that is less likely to get misinterpreted by the planner:
MATCH (name:Name) WHERE NOT name.name = 'Jones, P'
WITH name
MATCH (:`Article` {article_id:'1234'})-[:WRITTEN_BY]->()-[h1:HAS_NAME]->(name)<-[h2:HAS_NAME]-()<-[:WRITTEN_BY]-(:`Article` {article_id:'5678'})
RETURN name.name, h1.distance + h2.distance
There's really only one path you want to find, and you want to find it for any author whose name is not Jones, P
. Take advantage of your shared :Name
nodes to start your query with the smallest set of definite points and expand paths from there. You are generating a massive cartesian product by stacking all those MATCH
statements and then filtering them out.
As for the difference in query performance, it appears that the query planner is trying to use the Author
label to build your 3rd and 4th paths, whereas if you leave it out, the planner will only touch the much narrower set of :Article
s (fixed by indexed property), then expand relationships through the (incidentally very small) set of nodes that have -[:WRITTEN_BY]->
relationships, and then the (also incidentally very small) set of those nodes that have a -[:HAS_NAME]->
relationship. That decision is based partly on the predictable size of the various sets, so if you have a different number of :Author
nodes on the server, the planner will make a smarter choice and not use them.
Upvotes: 2