How do I return a list of any nodes that meet certain criteria along with values available only if a criterion is met?

Question

I want to get a list of countries from my data set. Assume the org I'm writing this for "operates" in a variety of countries. However, the definition of operate can be framed in a few ways.

First, if there's a person who works for the company sitting in that country.

Second, if there's any awarded business being supported in that country (the award is known to be in the country if a task within the award is attached to the country)

Third, if the organization is specifically licensed to do business there. The company might be licensed, but not actually doing any work there currently, for instance.

So, I want to get some json out the other end that I can use to power interactions with an svg country map. That means I either need a json list of the countries that meet any of these criteria, along with some information about which of the criteria each country meets (e.g.: {Country:"The United State", Code:"USA", hasPeople:true, hasActiveAward: false, isLicensed:true}) or I need three separate lists (one for each criteria) with a subset of the countries in it.

I've tried a variety of things, including this one below, and this one fails most spectacularly while also being the simplest to understand by simply reading.

// get all types of country
match (c1)<-[:SITS_IN|GEO_IN*]-(p:Person)
match (c2)<-[:TASK_IN_COUNTRY]-(:Task)-[:PART_OF*2]->(a:Award {awardStatus:"ACTIVE"})
match (c3)<-[:LICENSED_IN]-(d:Department)

return {
    peopleCountries:[{country:c1.name, code:c1.code, people:count(p)}],
    awardCountries:[{country:c2.name, code:c2.code, awards:count(a)}],
    licensedCountries:[{country:c3.name, code:c3.code, department:d.name}]
    }

This is another one I tried that actually works okay, but still takes about 14 seconds to run:

match (c:Country)
where exists((c)<-[:SITS_IN|GEO_IN*]-(:Person)) or
exists((c)<-[:TASK_IN_COUNTRY]-(:Task)-[:PART_OF*2]->(:Award {awardStatus:"ACTIVE"}))
or exists((c)<-[:LICENSED_IN]-(:Department))
with c
optional match (c)<-[:SITS_IN|GEO_IN*]-(p:Person)
optional match (c)<-[:TASK_IN_COUNTRY]-(:Task)-[:PART_OF*2]->(a:Award {awardStatus:"ACTIVE"})
optional match (c)<-[:LICENSED_IN]-(d:Department)
with c, count(p) as people, count(a) as awards, d.name as department
return {
    country:c.name,
    people:people,
    awards:awards,
    department:department
}

Any help much appreciated. I sort of think I should be either separating these out into three entirely separate requests, or using path variables, but I'm still just getting started on Cypher, so I'm open to learning better ways to think about this.

Thanks!

InverseFalcon · Accepted Answer

One thing to be careful about are multiple matches or optional matches in a row where your intent is to aggregate results later. Keep in mind that Neo4j builds up results as rows, and processes operations (matches and optional matches) on those rows, even if it would be redundant.

For example, taking your first two optional matches:

with c
optional match (c)<-[:SITS_IN|GEO_IN*]-(p:Person)
optional match (c)<-[:TASK_IN_COUNTRY]-(:Task)-[:PART_OF*2]->(a:Award {awardStatus:"ACTIVE"})

Let's say for a country, say France, there are 50 people as a result of the first optional match. For France alone, we now have 50 rows: France paired with each person that matched. Now we run the second optional match, and even though you intend it to only execute once per country, it's executing once per row, so it executes 50 times on France for each row of France with a person from the first match.

And so on. If 3 awards were found for France, you now have 150 rows for France, a row for each combination of France, one of the persons who matched, and one of the awards that matched. And now the last optional match has to run on those 150 rows. And this is just theoretical for a single country when there are likely multiple countries you are processing.

To avoid all this, it's best to process the aggregations of each optional match immediately, using WITH between them, so after each WITH the number of rows stays constant as one per country. That fits with your optional matches, since all of them match from a country, and your intent is for them to run once per country, not multiple times. An improved query might look like:

match (c:Country)
where exists((c)<-[:LICENSED_IN]-(:Department))
or exists((c)<-[:SITS_IN|GEO_IN*]-(:Person)) 
or exists((c)<-[:TASK_IN_COUNTRY]-(:Task)-[:PART_OF*2]->(:Award {awardStatus:"ACTIVE"}))
with c
optional match (c)<-[:SITS_IN|GEO_IN*]-(p:Person)
with c, count(distinct p) as people
optional match (c)<-[:TASK_IN_COUNTRY]-(:Task)-[:PART_OF*2]->(a:Award {awardStatus:"ACTIVE"})
with c, people, count(distinct a) as awards
optional match (c)<-[:LICENSED_IN]-(d:Department)
with c, people, awards, collect(d.name) as departments
return {
    country:c.name,
    people:people,
    awards:awards,
    departments:departments
}

I reordered your first match predicates a little bit so the easier and quicker comparisons happen first, that will save you from executing the more complicated predicates needlessly.

I also collected departments that are licensed in the country (I'm assuming multiple departments can be licensed, not just a single department).

The output should be a single row per country.

How do I return a list of any nodes that meet certain criteria along with values available only if a criterion is met?

Answers (1)

Related Questions