Reputation: 8913
I have created a very large directional weighted graph, and I'm trying to find the widest path between two points.
each edge has a count property
Here is a small portion of the graph:
I have found this example and modified the query, so the path collecting would be directional like so:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, EXTRACT(c IN RELATIONSHIPS(p) | c.count) AS counts
UNWIND(counts) AS b
WITH p, MIN(b) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1
This query seems to require an enormous amount of memory, and fails even on partial data.
Update: for classification, the query is running until running out of memory.
I've found this link, that combines the use of spark and neo4j. Unfortunately Mazerunner for Neo4j, does not support "widest path" algorithm out of the box. What would be the right approach to run the "widest path" query on a very large graph?
Upvotes: 2
Views: 630
Reputation: 67019
Some thoughts.
Your query (and the original example query) can be simplified. This may or may not be sufficient to prevent your memory issue.
For each matched path, there is no need to: (a) create a collection of counts, (b) UNWIND it into rows, and then (c) perform a MIN aggregation. The same result could be obtained by using the REDUCE function instead:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1;
(I assume that the count
property value is an int. 2147483647
is the max int value.)
You should create an index (or, perhaps more appropriately, a uniqueness constraint) on the name
property of the Vertex
label. For example:
CREATE INDEX ON :Vertex(name)
EDITED
This enhanced version of the above query might solve your memory problem:
MERGE (t:Temp) SET t.count = 0, t.widest_path = NULL
WITH t
OPTIONAL MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH t, p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
WHERE count > t.count
SET t.count = count, t.widest_path = NODES(p)
WITH COLLECT(DISTINCT t)[0] AS t
WITH t, t.count AS count, t.widest_path AS `Widest Path`
DELETE t
RETURN `Widest Path`, count;
It creates (and ultimately deletes) a temporary :Temp
node to keep track of the currently "winning" count and (the corresponding path nodes). (You must make sure that the label Temp
is not otherwise used.)
The WITH
clause starting with COLLECT(DISTINCT t)
uses aggregation of distinct :Temp
nodes (of which there is only 1) to ensure that Cypher only keeps a single reference to the :Temp
node, no matter how many paths satisfy the WHERE
clause. Also, that WITH
clause does NOT include p
, so that Cypher does not accumulate paths that we do not care about. It is this clause that might be the most important in helping to avoid your memory issues.
I have not tried this out.
Upvotes: 2
Reputation: 18022
The reason your algorithm is taking a long time to run is because (a) you have a big graph, (b) your memory parameters probably need tweaking (see comments) and (c) you're enumerating every possible path between ENTRY
and EXIT
. Depending on what your graph is structured like, this could be a huge number of paths.
Note that if you're looking for the broadest path, then broadest is the largest/smallest weight on an edge. This means that you're probably computing and re-computing many paths you can ignore.
Wikipedia has good information on this algorithm you should consider. In particular:
It is possible to find maximum-capacity paths and minimax paths with a single source and single destination very efficiently even in models of computation that allow only comparisons of the input graph's edge weights and not arithmetic on them.[12][18] The algorithm maintains a set S of edges that are known to contain the bottleneck edge of the optimal path; initially, S is just the set of all m edges of the graph. At each iteration of the algorithm, it splits S into an ordered sequence of subsets S1, S2, ... of approximately equal size; the number of subsets in this partition is chosen in such a way that all of the split points between subsets can be found by repeated median-finding in time O(m). The algorithm then reweights each edge of the graph by the index of the subset containing the edge, and uses the modified Dijkstra algorithm on the reweighted graph; based on the results of this computation, it can determine in linear time which of the subsets contains the bottleneck edge weight. It then replaces S by the subset Si that it has determined to contain the bottleneck weight, and starts the next iteration with this new set S. The number of subsets into which S can be split increases exponentially with each step, so the number of iterations is proportional to the iterated logarithm function, O(logn), and the total time is O(m logn).[18] In a model of computation where each edge weight is a machine integer, the use of repeated bisection in this algorithm can be replaced by a list-splitting technique of Han & Thorup (2002), allowing S to be split into O(√m) smaller sets Si in a single step and leading to a linear overall time bound.
You should consider implementing this approach with cypher rather than your current "enumerate all paths" approach, as the "enumerate all paths" approach has you re-checking the same edge counts for as many paths as there are that involve that particular edge.
There's not ready-made software that will just do this for you, I'd recommend taking that paragraph (and checking its citations for further information) and then implementing that. I think performance wise you can do much better than your current query.
Upvotes: 4