Mehran
Mehran

Reputation: 16941

How to ask Neo4j to take cycles into account in an optimised way

This is a follow-up question to:

How to ask Neo4j to take cycles into account

In my previous question, @stdob-- kindly helped me find the following query:

MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)
OPTIONAL MATCH (n3t)-[:R]->(n4:R:L)
WHERE n3t = n3
RETURN labels(n1), labels(n3t), labels(n4);

The above query is a replacement for the following:

MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)-[:R]->(n4:R:L)
RETURN labels(n1), labels(n3t), labels(n4);

And I have to use the first one because in my data there's the possibility that n2 and n4 are the same nodes and since Neo4j refuses to take the same node twice, it will return null.

While the first query is valid and working, it has got a really bad performance. It forces the database to restart the search for the whole data and at the end, it will match the selected nodes using n3t = n3. Just to give you a hint on how bad its performance is, on a dataset of 200k magnitude, it takes 5 seconds to return the result while if I omit the second OPTIONAL MATCH and its WHERE the result is generated in less than 10 milliseconds for the same query. If anyone's interested, here's the execution plan for the query:

enter image description here

The right branch is the part I mentioned earlier (which I tried to fool Neo4j to take a node for the second time). As you can see 2M db hits were incurred in order to make Neo4j take a node for the second time. The actual query for this execution plan is:

PROFILE MATCH (n5_1:Revision:`Account`)<-[:RevisionOf]-(n5_2:Entity:`Account`) 
WITH n5_2, n5_1
ORDER BY n5_1.customer_number ASC
LIMIT 100
OPTIONAL MATCH (n5_1)-[:`Main Contact`]->(n4_1:Wrapper)<-[:Wrapper]-(:Revision:`Contact`)<-[:RevisionOf]-(n4_2:Entity:`Contact`) 
OPTIONAL MATCH (n4_4)-[:RevisionOf]->(n4_3:Revision:Latest:`Contact`:Active) 
WHERE (n4_2) = (n4_4) 
RETURN n5_1, n5_2, n4_1, n4_2, n4_3

So my question is, how can I write a Cypher query in which a node is taken for the second time while the performance is not gonna suffer?

For some example data and testbed, please go to the other question.

Upvotes: 1

Views: 39

Answers (2)

InverseFalcon
InverseFalcon

Reputation: 30417

I posted on your linked question an answer that should give the result table you described. If that fits what you're looking for, this query uses the same approach, and may be the solution for this question:

PROFILE MATCH (n5_1:Revision:`Account`)<-[:RevisionOf]-(n5_2:Entity:`Account`) 
WITH n5_2, n5_1
ORDER BY n5_1.customer_number ASC
LIMIT 100
OPTIONAL MATCH (n5_1)-[:`Main Contact`]->(n4_1:Wrapper)<-[:Wrapper]-(:Revision:`Contact`)<-[:RevisionOf]-(n4_2:Entity:`Contact`)
WHERE (n4_2)-[:RevisionOf]->(:Revision:Latest:`Contact`:Active)
OPTIONAL MATCH (n4_2)-[:RevisionOf]->(n4_3:Revision:Latest:`Contact`:Active) 
RETURN n5_1, n5_2, n4_1, n4_2, n4_3

This keeps the n4_2 in your last OPTIONAL MATCH, which should solve the performance issue you observed.

As you noted in your previous question, you want to avoid circumstances where the first OPTIONAL MATCH succeeds, but the second fails, leaving the variables as non-null from the first OPTIONAL MATCH when you don't want them to be.

We solve that issue by adding a WHERE after the first OPTIONAL MATCH, forcing the match to succeed only if the pattern you're looking for in the second OPTIONAL MATCH exists off of the last node (this will work even if such a pattern reuses relationships and nodes from the OPTIONAL MATCH).

Upvotes: 1

stdob--
stdob--

Reputation: 29167

You can try collect the tail in additionally:

PROFILE 
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)
WITH n1, [null] + ( (n3)-[:R]->(:R:L) ) as tail
WITH n1, tail, size(tail) as tailSize
UNWIND tail as t
WITH n1, tailSize, t WHERE (tailSize = 2 AND NOT t is NULL) OR tailSize = 1
WITH n1, nodes(t) as nds
WITH n1, nds[0] as n3t, nds[1] as n4
RETURN labels(n1), labels(n3t), labels(n4) 

Upvotes: 1

Related Questions