Reputation: 16941
This is a follow-up question to:
How to ask Neo4j to take cycles into account
In my previous question, @stdob-- kindly helped me find the following query:
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)
OPTIONAL MATCH (n3t)-[:R]->(n4:R:L)
WHERE n3t = n3
RETURN labels(n1), labels(n3t), labels(n4);
The above query is a replacement for the following:
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)-[:R]->(n4:R:L)
RETURN labels(n1), labels(n3t), labels(n4);
And I have to use the first one because in my data there's the possibility that n2
and n4
are the same nodes and since Neo4j refuses to take the same node twice, it will return null.
While the first query is valid and working, it has got a really bad performance. It forces the database to restart the search for the whole data and at the end, it will match the selected nodes using n3t = n3
. Just to give you a hint on how bad its performance is, on a dataset of 200k magnitude, it takes 5 seconds to return the result while if I omit the second OPTIONAL MATCH
and its WHERE
the result is generated in less than 10 milliseconds for the same query. If anyone's interested, here's the execution plan for the query:
The right branch is the part I mentioned earlier (which I tried to fool Neo4j to take a node for the second time). As you can see 2M db hits were incurred in order to make Neo4j take a node for the second time. The actual query for this execution plan is:
PROFILE MATCH (n5_1:Revision:`Account`)<-[:RevisionOf]-(n5_2:Entity:`Account`)
WITH n5_2, n5_1
ORDER BY n5_1.customer_number ASC
LIMIT 100
OPTIONAL MATCH (n5_1)-[:`Main Contact`]->(n4_1:Wrapper)<-[:Wrapper]-(:Revision:`Contact`)<-[:RevisionOf]-(n4_2:Entity:`Contact`)
OPTIONAL MATCH (n4_4)-[:RevisionOf]->(n4_3:Revision:Latest:`Contact`:Active)
WHERE (n4_2) = (n4_4)
RETURN n5_1, n5_2, n4_1, n4_2, n4_3
So my question is, how can I write a Cypher query in which a node is taken for the second time while the performance is not gonna suffer?
For some example data and testbed, please go to the other question.
Upvotes: 1
Views: 39
Reputation: 30417
I posted on your linked question an answer that should give the result table you described. If that fits what you're looking for, this query uses the same approach, and may be the solution for this question:
PROFILE MATCH (n5_1:Revision:`Account`)<-[:RevisionOf]-(n5_2:Entity:`Account`)
WITH n5_2, n5_1
ORDER BY n5_1.customer_number ASC
LIMIT 100
OPTIONAL MATCH (n5_1)-[:`Main Contact`]->(n4_1:Wrapper)<-[:Wrapper]-(:Revision:`Contact`)<-[:RevisionOf]-(n4_2:Entity:`Contact`)
WHERE (n4_2)-[:RevisionOf]->(:Revision:Latest:`Contact`:Active)
OPTIONAL MATCH (n4_2)-[:RevisionOf]->(n4_3:Revision:Latest:`Contact`:Active)
RETURN n5_1, n5_2, n4_1, n4_2, n4_3
This keeps the n4_2
in your last OPTIONAL MATCH, which should solve the performance issue you observed.
As you noted in your previous question, you want to avoid circumstances where the first OPTIONAL MATCH succeeds, but the second fails, leaving the variables as non-null from the first OPTIONAL MATCH when you don't want them to be.
We solve that issue by adding a WHERE after the first OPTIONAL MATCH, forcing the match to succeed only if the pattern you're looking for in the second OPTIONAL MATCH exists off of the last node (this will work even if such a pattern reuses relationships and nodes from the OPTIONAL MATCH).
Upvotes: 1
Reputation: 29167
You can try collect the tail in additionally:
PROFILE
MATCH (n1:S)
OPTIONAL MATCH (n1)-[:R]->(n2:R)<-[:R]-(n3:E)
WITH n1, [null] + ( (n3)-[:R]->(:R:L) ) as tail
WITH n1, tail, size(tail) as tailSize
UNWIND tail as t
WITH n1, tailSize, t WHERE (tailSize = 2 AND NOT t is NULL) OR tailSize = 1
WITH n1, nodes(t) as nds
WITH n1, nds[0] as n3t, nds[1] as n4
RETURN labels(n1), labels(n3t), labels(n4)
Upvotes: 1