J. Gambolputty
J. Gambolputty

Reputation: 53

Performance of the transitive closure query in Neo4j

I am trying to compute the transitive closure of an undirected graph in Neo4j using the following Cypher Query ("E" is the label that every edge of the graph has):

MATCH (a) -[:E*]- (b) WHERE ID(a) < ID(b) RETURN DISTINCT a, b

I tried to execute this query on a graph with 10k nodes and around 150k edges, but even after 8 hours it did not finish. I find this surprising, because even the most naive SQL solutions are much faster and I expected that Neo4j would be much more efficient for these kind of standard graph queries. So is there something that I am missing, maybe some tuning of the Neo4j server or a better way to write the query?

Edit

Here is the result of EXPLAINing the above query:

+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
908 ms

Compiler CYPHER 3.3

Planner COST

Runtime INTERPRETED

+-----------------------+----------------+------------------+--------------------------------+
| Operator              | Estimated Rows | Variables        | Other                          |
+-----------------------+----------------+------------------+--------------------------------+
| +ProduceResults       |          14069 | a, b             |                                |
| |                     +----------------+------------------+--------------------------------+
| +Distinct             |          14069 | a, b             | a, b                           |
| |                     +----------------+------------------+--------------------------------+
| +Filter               |          14809 | anon[11], a, b   | ID(a) < ID(b)                  |
| |                     +----------------+------------------+--------------------------------+
| +VarLengthExpand(All) |          49364 | anon[11], b -- a | (a)-[:E*]-(b)                  |
| |                     +----------------+------------------+--------------------------------+
| +AllNodesScan         |          40012 | a                |                                |
+-----------------------+----------------+------------------+--------------------------------+

Total database accesses: ?

Upvotes: 0

Views: 671

Answers (1)

Rebecca Nelson
Rebecca Nelson

Reputation: 1296

You can limit the direction, but it requires the graph to be directed.

After doing some testing and profiling of my own, I found that for even very small sets of data (Randomly-generated sets of 10 nodes with 2 random edges on each), making the query be only for a single direction cut down on database hits by a factor of 10000 (from 2266909 to 149 database hits).

Adding a direction to your query (and thus forcing the graph to be directed) cuts down the search space by a great deal, but it requires the graph to be directed.

I also tried simply adding a reverse relationship for each directed one, to see if that would have similar performance. It did not; it did not complete before 5 minutes had passed, at which point I killed it.

Unfortunately, you are not doing anything wrong, but your query is massive.

Neo4J being a graph database does not mean that all mathematical operations involving graphs will be extremely fast; they are still subject to performance constraints, up to and including the transitive closure operation.

The query you have written is an unbounded path search for every single pair of nodes. The node pairs are bounded, but not in a very meaningful way (the bound of ID(a) < ID(b) just means that the search only needs to be done one way; there are still 10k! (as in factorial) possible sets of nodes in the result set.

And then, that's only after every single path is checked. Searching for the entire transitive closure of a graph the size that you specified will be extremely expensive performance-wise.

The SQL that you posted is not performing the same operation.

You mentioned in the comments that you tried this query in a relational table in a recursive form:

WITH RECURSIVE temp_tc AS (
  SELECT v AS a, v AS b FROM nodes
  UNION SELECT a,b FROM edges g
  UNION SELECT t.a,g.b FROM temp_tc t, edges g WHERE t.b = g.a
)
SELECT a, b FROM temp_tc;

I should note that this query is not performing the same thing that Neo4J does when it tries to find all paths. Before Neo4J can start to pare down your results, it must generate a result set that consists of every single path in the entire graph.

The SQL and relational query does not do that; it starts from the list of links, but that recursive query has the effect of removing any potential duplicate links; it discovers other links as its searching for the links of others; e.g. if the graph looks like (A)-(B)-(C), that query will find that B connects to C in the process of finding that A connects to C.

With the Neo4J, every path must be discovered separately.

If this is your general use-case, it is possible that Neo4J is not a good choice if speed is a concern.

Upvotes: 0

Related Questions