Reputation: 817
I have imported data from a CSV file and created a lot of Nodes, all of which are related to other Nodes within the same data set based on a "Tree Number" hierarchy system:
For example, the Node with Tree Number A01.111 is a direct child of Node A01, and the Node with Tree Number A01.111.230 is a direct child of Node A01.111.
What I am trying to do is create unique relationships between Nodes that are direct children of other Nodes. For example Node A01.111.230 should only have one "IS_CHILD_OF" relationship, with Node A01.111.
I have tried several things, for example:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (n <> n2)
AND NOT ((n2)-[:IS_CHILD_OF]->())
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
This example results in creating unique "IS_CHILD_OF" relationships but not with the direct parent of a Node. Rather, Node A01.111.230 would be related to Node A01.
Upvotes: 1
Views: 991
Reputation: 11705
I'd like to suggest another general solution, also avoiding a cartesian product as @InverseFalcon points out.
Let's indeed start by creating an index for faster lookup, and inserting some test data:
CREATE CONSTRAINT ON (n:Node) ASSERT n.treeNumber IS UNIQUE;
CREATE (n:Node {treeNumber: 'A01.111.230'})
CREATE (n:Node {treeNumber: 'A01.111'})
CREATE (n:Node {treeNumber: 'A01'})
Then we need to scan all nodes as potential parents, and look for children which start with the treeNumber
of the parent (STARTS WITH
can use the index) and have no dots in the "remainder" of the treeNumber
(i.e. a direct child), instead of splitting, joining, etc.:
MATCH (p:Node), (c:Node)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
I replaced the creation of the relationship by a simple RETURN
for profiling purposes, but you can simply replace it by CREATE UNIQUE
or MERGE
.
Actually, we can get rid of the p <> c
predicate and the + 1
on the length by pre-computing the actual prefix which should match:
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
However, profiling that query shows that the index is not used, and there is a cartesian product (so we have a O(n^2) algorithm):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 26 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) AND StartsWith(c.treeNumber,parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Apply | 2 | 9 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 9 | 9 | 12 | c | :Node |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 45
But, if we simple add a hint like so
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
it does use the index and we have something like a O(n*log(n)) algorithm (log(n) for the index lookup):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 6 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Apply | 2 | 3 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| | +NodeUniqueIndexSeekByRange | 9 | 3 | 6 | c | :Node(treeNumber STARTS WITH parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
Total database accesses: 19
Note that I did cheat a bit when introducing the WITH
step creating the prefix earlier, as I noticed it improved the execution plan and DB accesses over
MATCH (p:Node), (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
which has the following execution plan:
Compiler CYPHER 3.0
Planner RULE
Runtime INTERPRETED
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DB Hits | Variables | Other |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 9 | c, p | NOT(p == c) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +SchemaIndex | 6 | 12 | c -- p | PrefixSeekRangeExpression(p.treeNumber); :Node(treeNumber) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabel | 3 | 4 | p | :Node |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 25
Finally, for the record, the execution plan of the original query I wrote (i.e. without the hint) was:
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 21 | c, p | NOT(p == c) AND StartsWith(c.treeNumber,p.treeNumber) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +CartesianProduct | 9 | 9 | 0 | p -- c | |
| |\ +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 3 | 9 | 12 | c | :Node |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 37
It's not the worse one: the one without the hint but with the pre-computed prefix is! This is why you should always measure.
Upvotes: 3
Reputation: 30397
I think we can improve on the query a bit. First, ensure you have either a unique constraint or an index on :Node.treeNumber, as you'll need that to improve your parent node lookups in this query.
Next, let's match on child nodes, excluding root nodes (assuming no .'s in the root's treeNumber) and nodes that have already been processed and have a relationship already.
Then we'll find each node's parent by the treeNumber using our index, and create the relationship. This assumes that a child treeNumber always has 4 more characters, including the dot.
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SUBSTRING(child.treeNumber, 0, SIZE(child.treeNumber)-4) as parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
I think this query avoids a cartesian product as you may get from other answers, and should be around O(n) (someone correct me if I'm wrong).
EDIT
In the event that each subset of numbers in treeNumbers is NOT constrained to 3 (as in your description, actually, with 'A01.111.23'), then you need a different means of deriving the parentNumber. Neo4j is a little weak here, as it lacks both an indexOf() function as well as a join() function to reverse a split(). You may need the APOC Procedures library installed to allow access to a join() function.
The query to handle cases with variable counts of digits in the numeric subsets of treeNumber becomes this:
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SPLIT(child.treeNumber, '.') as splitNumber
CALL apoc.text.join(splitNumber[0..-1], '.') YIELD value AS parentNumber
WITH child, parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
Upvotes: 1
Reputation: 833
For your requirement STARTS WITH
won't work, since A01.111.23 does indeed start with A01 in addition to starting with A01.111.
The treeNumber
is made up of several parts with '.' as the separator. Let's not make any assumptions about the maximum/minimum possible character lengths of the individual parts. What we need is to compare all but the last part of each node's treeNumber
with that of the potential child node being tested. You can achieve this using Cypher's split()
function as follows:
MATCH (n1:Node), (n2:Node)
WHERE split(n2.treeNumber,'.')[0..-1] = split(n1.treeNumber,'.')
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n1);
The split()
function splits a string, at each occurrence of a given separator, into a list of strings (parts). In this context the separator is '.' to split any treeNumber
. We can select a subset of a list in cypher using the syntax list[{startIndex}..{endIndex}]
. Negative indices for reverse lookup are permitted, such ass the one used in the above query.
This solution should generalize to all possible treeNumber
values, in the format at hand, irrespective of number of parts and individual part lengths.
Upvotes: 0
Reputation: 817
I think I just figured out a solution! (If someone has a more elegant one please do post)
I just realized that the "Tree Number" coding system always uses 3-digit numbers between the dots, i.e. A01.111.230 or C02.100, therefore if a Node is the direct child of another Node, it's "Tree Number" should not only start with the Tree Number of the parent Node, it should also be 4 characters longer (one character for the dot '.' and 3 characters for the numeric value).
Therefore my solution that seems to do the job is:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (length(n2.treeNumber) = (length(n.treeNumber) + 4))
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
Upvotes: 0