Reputation: 621
Assuming a graph like this:
(Thanks to https://neo4j.com/blog/neo4j-2-0-ga-graphs-for-everyone/ )
(Not shown but assume all countries, all artists, and all recording contracts are in the graph)
What would the CYPHER be for:
(United Kingdom)<-[]-(Iron Maiden)-[]->(Epic)-[]->(United States)
, but not (United Kingdom)<-[]-(Hybrid Theory)-[]->(Mad Decent)-[]->(United States)
or (United Kingdom)<-[]-(Iron Maiden)-[]->(Columbia)-[]->(United States)
, for example(United Kingdom)-[]-(United States)
, one for (Japan)-[]-(Canada)
, etc. Bonus points for LIMIT 20
limiting it to either 20 paths or 20 country nodesEdit: I've tried various combinations of MATCH (c1:Country)-[]-(c2:Country)
, MATCH p=((c1:Country)-[]-(c2:Country))
, WITH
, and UNWIND
. I've also tried to use FOREACH
to return only one path, but can't quite get the formula right.
Upvotes: 0
Views: 1392
Reputation: 30397
This is easier if you are using subqueries (Neo4j 4.1.x or higher). That's because the subquery can help scope the operations you need to perform (collect(), in this case) to expansions and work from a single country, per country, instead of having to perform it across all rows for the entirety of the query, which could stress the heap.
In reality, since the number of countries are low, it won't be a problem, but it's a good approach to use when dealing with larger sets of nodes.
MATCH (country:Country)
CALL {
WITH country
MATCH path = (country)<-[:FROM_AREA]-(:Artist)-[:RECORDING_CONTRACT]->(:Label)-[:FROM_AREA]->(other:Country)
WHERE id(country) < id(other)
RETURN other, collect(path)[0] as path
LIMIT 20
}
RETURN country, path
LIMIT 20
Let's look at what this is doing. We MATCH to :Country nodes.
Per country we will MATCH to the pattern you're looking for. If these are the only such paths and labels in the graph, then you can omit the labels in the pattern, as the relationship types should be enough to find the correct nodes.
The WHERE id(country) < id(other)
is here to prevent mirrored results. For example, in the course of the query if we find a path from (United Kingdom)-[*]-(United States)
, and we also find a path the other direction, for (United States)-[*]-(United Kingdom)
, you probably don't want to return both. So we place a restriction on the graph ids so that only one of these will meet the restriction, and the mirrored result gets filtered out.
We use RETURN other, collect(path)[0] as path
to get a single path per the country and other nodes. Remember that this is happening inside a subquery being called per country node, so even though country
is not present here, this operation is being performed for a specific country node.
When we aggregate (such as with this collect(path)
, the grouping key (usually the non-aggregation variables) become distinct, so for the country and the other country, this will collect all the paths between them and then take the first of that list of paths, so we get our single path between two distinct countries.
We LIMIT the subquery results to 20, since we know in total we don't want more than 20 paths, so per country we don't want more than 20 paths either. This might be a bit redundant for this case, but when the query is more complex it is the right approach to make sure you're not doing more work than is needed.
We also have another LIMIT outside the subquery, so that if there are only a few countries processed, with a few paths per country, the total paths won't exceed 20.
Upvotes: 3