Peter Whitfield
Peter Whitfield

Reputation: 525

implementing a 'greedy' match to find the extent of a subtree in Cypher

I have a graph that contains many 'subtrees' of items where an original item can be cloned which results in

(clone:Item)-[:clones]->(original:Item)

and a cloned item can also be cloned:

(newclone:Item)-[:clones]->(clone:Item)

the first item is created by a user:

(:User)-[:created]->(:item)

and the clones are collected by a user:

(:User)-[:collected]->(:item)

Given any item in the tree, I want to be able to match all the items in the tree. I'm using:

(1)    match (known:Item)-[:clones*]-(others:Item)

My understanding is that this implements a 'greedy' match, traversing the tree in all directions, matching all items.

In general this works, however in some circumstances it doesn't seem to match all the items in the tree. For example, in the following query, this doesn't seem to be matching the whole subtree.

match p = (known:Item)-[r:clones*]-(others:Item) where not any(x in nodes(p) where (x)<-[:created]-(:User))  return p

Here I'm trying to find subtrees which are missing a 'created' Item (which were deleted in the source SQL database.

What I'm finding is that it giving me false positives because it's matching only part of a particular tree. For example, if there is a tree with 5 items structured properly as described above, it seems (in some cases) to be matching a subset of the tree (maybe 2 out of 5 items) and that subset doesn't contain the created card and so is returned by the query when I didn't expect it to.

Question Is my logic correct or am I misunderstanding something? I'm suspecting that I'm misunderstanding paths, but I'm confused by the fact that the basic 'greedy' match works in most cases.

I think that my problem is that I've been confused because the query is finding multiple paths in the tree, some of which satisfy the test in the query and some don't. When viewed in the neo4j visualisation, the multiple paths are consolidated into what looks like the whole tree whereas the tabular results show that the match (1) above actually gives multiple paths.

I'm now thinking that I should be using collections rather than paths for this.

Upvotes: 1

Views: 309

Answers (1)

jjaderberg
jjaderberg

Reputation: 9952

You are quite right that the query matches more paths than what is apparent in the browser visualization. The query is greedy in the sense that it has no upper bound for depth, but it also has no lower bound (well, strictly the lower bound is 1), which means it will emit a short path and a longer path that includes it if there are such. For data like

CREATE
(u)-[:CREATED]->(i)<-[:CLONES]-(c1)<-[:CLONES]-(c2)

the query will match paths

i<--c1
i<--c1<--c2
c1<--c2
c2-->c1
c2-->c1-->i
c1-->i

Of these paths, only the ones containing i will be filtered by the condition NOT x<-[:CREATED]-(), leaving paths

c1<--c2
c2-->c1

You need a further condition in your pattern before that filter, a condition such that each path that passes it should contain some node x where x<-[:CREATED]-(). That way that filter condition is unequivocal. Judging from the example model/data in your question, you could try matching all directed variable depth (clone)-[:CLONES]->(cloned) paths, where the last cloned does not itself clone anything. That last cloned should be a created item, so each path found can now be expected to contain a b<-[:CREATED]-(). That is, if created items don't clone anything, something like this should work

MATCH (a)-[:CLONES*]->(b)
WHERE NOT b-[:CLONES]->()
    AND NOT b<-[:CREATED]-()

This relies on only matching paths where a particular node in each path can be expected to be created. An alternative is to work on each whole tree by itself by getting a single pointer into the tree, and test the entire tree for any created item node. Then the problem with your query could be said to be that it treats c1<--c2 as if it's a full tree and he solution is a pattern that only matches once for a tree. You can then collect the nodes of the tree with the variable depth match from there. You can get such a pointer in different ways, easiest is perhaps to provide a discriminating property to find a specific node and collect all the items in that node's tree. Perhaps something like

MATCH (i {prop:"val"})-[:CLONES*]-(c)
WITH i, collect(distinct c) as cc
WHERE NOT (
    i<-[:CREATED]-() OR
    ANY (c IN cc WHERE c<-[:CREATED]-()
) //etc

This is not a generic query, however, since it only works on the one tree of the one node. If you have a property pattern that is unique per tree, you can use that. You can also model your data so that each tree has exactly one relationship to a containing 'forest'.

MATCH (forest)-[:TREE]->(tree)-->(item)-[:CLONES*]-(c) // etc

If your [:COLLECTED] or some other relationship, or a combination of relationships and properties make a unique pattern per tree, these can also be used.

Upvotes: 1

Related Questions