Dalamar42
Dalamar42

Reputation: 107

Given node A find all nodes in A's subgraph with linear time complexity in Neo4j

I am in the process of evaluating Neo4j for use in a production environment and I have encountered some difficulty when doing something that I expected to be simple. I have managed to solve it, but in a suboptimal and quite complicated way so I was thinking if there might be a simpler way to accomplish the same thing.

Preamble

Neo4j version 2.3.2

TL;DR

This is a bit of a long explanation so the summary is as follows:

Given node A, I need to find all nodes in the subgraph of A with a complexity of O(number_of_vertices + number_of_edges).

Problem

For our use case we have a graph that, with respect to one particular type of relationship, is broken up into smaller disconnected subgraphs of no more than a couple of dozen nodes each. What we are trying to accomplish is, given an indexed id from one of those nodes discover all nodes in the subgraph (while treating the graph as being undirected). The other peculiarity is that our nodes always have the reflexive edge from every node back to itself.

Algorithmically all we need is a breadth first search with an expected complexity of O(num_of_vertices + num_of_edges). Since our graphs are not dense the number of edges in them is roughly linear to the number of number of vertices so the overall complexity should be linear to the number of vertices in it.

Test graph

For simplicity I have made this test graph a fully connected one. Since the point is to compare cypher queries this does not affect the results.

Commands:

Simple query

The first query I tried to get the desired results is the following:

That query never terminated. When I added an upper bound for the relationship pattern and profiled the query I got the following:

So instead of being linear to the number of vertices and edges this looks like it's finding all possible paths which of course explodes with depth.

Better performing query

To accomplish this I wrote the following query instead:

https://gist.github.com/Dalamar42/1ec93cd74b01c145e7bd

(This will search to depth 2. Duplicate lines 6-16 to search to depth 4, 6, 8, and so on)

The query does the following:

  1. Get the entry node
  2. Add that node to a collection of nodes_found and another of nodes_to_visit
  3. For every node A in nodes_to_visit follow its edges to node B
  4. From nodes B keep only nodes that are not in nodes_found as nodes C
  5. Set nodes_to_visit to nodes C and set nodes_found to its previous value plus nodes C
  6. Repeat to the desired depth

This query should almost have the complexity of BFS except for one complexity. From what I understand every intermediate MATCH/WHERE needs to match at least one node otherwise cypher returns empty disregarding nodes found in the previous steps. I worked around this by changing step 4 to:

"From nodes B keep only node A and nodes that are not in nodes_found as nodes C"

Because all nodes have the reflexive edge, node A will always be in the set of nodes B and by always keeping it I make sure that that part of the query will always match at least one node.

This means this query has the following problems:

The upside of this query is that I am getting much better performance

Better solution? Does anyone know of a better/simpler solution? Am I missing some obvious feature of Cypher? I wasn't able to find anything going through the documentation.

Thanks

Upvotes: 2

Views: 1141

Answers (2)

InverseFalcon
InverseFalcon

Reputation: 30417

Cybersam's answer is fantastic, and performs fairly well. However, with APOC Procedures, there is an alternative that seems to perform even faster on large graphs (tested against a fully-interconnected movies graph).

APOC's path expander allows a bfs approach, and when using NODE_GLOBAL uniqueness, nodes are only ever visited once. This also allows for a much more concise query.

MATCH (a:Label { id:1 })
USING INDEX a:Label(id)
CALL apoc.path.expandConfig(a,{relationshipFilter:'REL', bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH a, LAST(NODES(path)) as b
RETURN b

Upvotes: 1

cybersam
cybersam

Reputation: 67044

[UPDATED]

Here is a simpler approach that is similar to yours, but it uses the COALESCE function to avoid having to artificially add the "entry node" into every "rim node" collection. (By "rim node", I mean a previously unencountered node that was found in the most recent match.)

The query assumes you have created an index on :Label(id) to speed up the first MATCH. The top section is just for getting the "entry node" and initializing the "res" (or result) and "rim" collections. The subsequent sections are just exact copies of each other, and can be repeated to match the desired depth of your search.

With a depth of 2, as shown here, only 40 db hits are expended.

Note1: Given your test data, only a depth of 1 is needed. In that case, only 16 DB hits are expended.

Note2: The 3rd WITH clause in each section is there to force a NULL into an empty rim collection. This is because UNWIND will abort the query if it is asked to unwind an empty collection. For some reason, passing [NULL] instead of [] to COALLESCE() does not work as expected.

MATCH (a:Label { id:1 })
USING INDEX a:Label(id)
WITH COLLECT(a) AS res
WITH res, res AS rim

UNWIND rim AS a
OPTIONAL MATCH (a)-[:REL]-(b:Label)
WHERE NOT b IN res
WITH res, COALESCE(COLLECT(DISTINCT b),[]) AS rim
WITH rim, res + rim AS res
WITH res, CASE rim WHEN [] THEN [NULL] ELSE rim END AS rim

UNWIND rim AS a
OPTIONAL MATCH (a)-[:REL]-(b:Label)
WHERE NOT b IN res
WITH res, COALESCE(COLLECT(DISTINCT b),[]) AS rim
WITH rim, res + rim AS res
WITH res, CASE rim WHEN [] THEN [NULL] ELSE rim END AS rim

RETURN res;

Upvotes: 2

Related Questions