Neo4j: How to return a single path for each pair of nodes that have multiple relationships

Question

Assuming a graph like this:

(Thanks to https://neo4j.com/blog/neo4j-2-0-ga-graphs-for-everyone/ )

(Not shown but assume all countries, all artists, and all recording contracts are in the graph)

What would the CYPHER be for:

Starting with United Kingdom, return one path for each country where there is at least one recording contract
- It doesn't matter which path is returned, just that it's a single path
- Should return (United Kingdom)<-[]-(Iron Maiden)-[]->(Epic)-[]->(United States), but not (United Kingdom)<-[]-(Hybrid Theory)-[]->(Mad Decent)-[]->(United States) or (United Kingdom)<-[]-(Iron Maiden)-[]->(Columbia)-[]->(United States), for example
Return a single path for each of any two countries that are connected
- Should return one path for (United Kingdom)-[]-(United States), one for (Japan)-[]-(Canada), etc. Bonus points for LIMIT 20 limiting it to either 20 paths or 20 country nodes
- Also does not matter which path is returned, just that it's a single path

Edit: I've tried various combinations of MATCH (c1:Country)-[]-(c2:Country), MATCH p=((c1:Country)-[]-(c2:Country)), WITH, and UNWIND. I've also tried to use FOREACH to return only one path, but can't quite get the formula right.

InverseFalcon · Accepted Answer

This is easier if you are using subqueries (Neo4j 4.1.x or higher). That's because the subquery can help scope the operations you need to perform (collect(), in this case) to expansions and work from a single country, per country, instead of having to perform it across all rows for the entirety of the query, which could stress the heap.

In reality, since the number of countries are low, it won't be a problem, but it's a good approach to use when dealing with larger sets of nodes.

MATCH (country:Country)
CALL {
 WITH country
 MATCH path = (country)<-[:FROM_AREA]-(:Artist)-[:RECORDING_CONTRACT]->(:Label)-[:FROM_AREA]->(other:Country)
 WHERE id(country) < id(other)
 RETURN other, collect(path)[0] as path
 LIMIT 20
}
RETURN country, path
LIMIT 20

Let's look at what this is doing. We MATCH to :Country nodes.

Per country we will MATCH to the pattern you're looking for. If these are the only such paths and labels in the graph, then you can omit the labels in the pattern, as the relationship types should be enough to find the correct nodes.

The WHERE id(country) < id(other) is here to prevent mirrored results. For example, in the course of the query if we find a path from (United Kingdom)-[*]-(United States), and we also find a path the other direction, for (United States)-[*]-(United Kingdom), you probably don't want to return both. So we place a restriction on the graph ids so that only one of these will meet the restriction, and the mirrored result gets filtered out.

We use RETURN other, collect(path)[0] as path to get a single path per the country and other nodes. Remember that this is happening inside a subquery being called per country node, so even though country is not present here, this operation is being performed for a specific country node.

When we aggregate (such as with this collect(path), the grouping key (usually the non-aggregation variables) become distinct, so for the country and the other country, this will collect all the paths between them and then take the first of that list of paths, so we get our single path between two distinct countries.

We LIMIT the subquery results to 20, since we know in total we don't want more than 20 paths, so per country we don't want more than 20 paths either. This might be a bit redundant for this case, but when the query is more complex it is the right approach to make sure you're not doing more work than is needed.

We also have another LIMIT outside the subquery, so that if there are only a few countries processed, with a few paths per country, the total paths won't exceed 20.

Neo4j: How to return a single path for each pair of nodes that have multiple relationships

Answers (1)

Related Questions