Performance in Neo4j cypher query

Question

I have the following cypher query:

MATCH (country:Country { name: 'norway' }) <- [:LIVES_IN] - (person:Person)
WITH person
MATCH (skill:Skill { name: 'java' }) <- [:HAS_SKILL] - (person)
WITH person 
OPTIONAL MATCH (skill:Skill { name: 'javascript' }) <- [rel:HAS_SKILL] - (person)
WITH person, CASE WHEN skill IS NOT NULL THEN 1 ELSE 0 END as matches 
ORDER BY matches DESC 
LIMIT 50 
RETURN COLLECT(ID(person)) as personIDs

It seems to perform worse when adding more nodes. Right now with only 5000 Person nodes (a Person node can have multiple HAS_SKILL relationships to Skill nodes). Right now it takes around 180 ms to perform the query, but adding another 1000 Person nodes with relationships adds 30-40 ms to the query. We are planning on having millions of Person nodes, so adding 40 ms every 1000 Person is a no go.

I use parameters in my query instead of 'norway', 'java', 'javascript' in the above query. I have created indexes on :Country(name) and :Skill(name).

My goal with the query is to match every person that lives in a specified country (norway) which also have the skill 'java'. If the person also have the skill 'javascript' it should be ordered higher in the result.

How can I restructure the query to improve performance?

Edit:

There also seems to be an issue with the :Country nodes, if I switch out

MATCH (country:Country { name: 'norway' }) <- [:LIVES_IN] - (person:Person)

with

MATCH (city:City { name: 'vancouver' }) <- [:LIVES_IN] - (person:Person)

the query time jumps down to around 15-50 ms, depending on what city i query for. It is still a noticeable increase in query time when adding more nodes.

Edit 2:

I seems like the query time is increased by a huge amount when there is a lot of rows in the first match clause. So if I switch the query to match on Skill nodes first, the query times decreases substantially. The query is part of an API and it is created dynamically and I do not know which of the match clauses that will return the smallest amount of rows. It will probably also be a lot more rows in every match clause when the database grows.

Edit 3

I have done some testing from the answers and I now have the following query:

MATCH (country:Country { name: 'norway'}) 
WITH country 
MATCH (country) <- [:LIVES_IN] - (person:Person) 
WITH person 
MATCH (person) - [:HAS_SKILL] -> (skill:Skill) WHERE skill.name = 'java' 
MATCH (person) - [:MEMBER_OF_GROUP] -> (group:Group) WHERE group.name = 'some_group_name' 
RETURN DISTINCT ID(person) as id 
LIMIT 50

this still have performance issues, is it maybe better to first match all the skills etc, like with the Country node? The query can also grow bigger, I may have to add matching against multiple skills, groups, projects etc.

Edit 4

I modified the query slightly and it seems like this did the trick. I now match all the needed skills, company, groups, country etc first. Then use those later in the query. In the profiler this reduced the database hits from 700k to 188 or something. It is a slightly different query from my original query (different labeled nodes etc), but it solves the same problem. I guess this can be further improved by maybe matching on the node with the least relationships first etc, to start with a reduced number of nodes. I'll do some more testing later!

MATCH (company:Company { name: 'relinkgroup' }) 
WITH company 
MATCH (skill:Skill { name: 'java' }) 
WITH company, skill 
MATCH (skill2:Skill { name: 'ajax' }) 
WITH company, skill, skill2 
MATCH (country:Country { name: 'canada' }) 
WITH company, skill, skill2, country 
MATCH (company) <- [:WORKED_AT] - (person:Person) 
, (person) - [:HAS_SKILL] -> (skill) 
, (person) - [:HAS_SKILL] -> (skill2) 
, (person) - [:LIVES_IN] -> (country) 
RETURN DISTINCT ID(person) as id 
LIMIT 50

Christophe Willemsen · Accepted Answer

For the first line of your query, the execution has to look for all possible paths between the country and person. Limiting your initial match (thus defining a more accurate starting point for the traversal) you'll win some performance.

So instead of

MATCH (country:Country { name: 'norway' }) <- [:LIVES_IN] - (person:Person)

Try doing it in two steps :

MATCH (country:Country { name: 'norway' })
WITH country
MATCH (country)<-[:LIVES_IN]-(person:Person)
WITH person

As an example, I'll use the simple movie app in the neo4j console : http://console.neo4j.org/

Doing a query equivalent to yours for finding people that knows cypher :

 MATCH (n:Crew)-[r:KNOWS]-m WHERE n.name='Cypher' RETURN n, m

The execution plan will be :

Execution Plan
ColumnFilter
  |
  +Filter
    |
    +TraversalMatcher

+------------------+------+--------+-------------+----------------------------------------+
|         Operator | Rows | DbHits | Identifiers |                                  Other |
+------------------+------+--------+-------------+----------------------------------------+
|     ColumnFilter |    2 |      0 |             |                      keep columns n, m |
|           Filter |    2 |     14 |             | Property(n,name(0)) == {  AUTOSTRING0} |
| TraversalMatcher |    7 |     16 |             |                                m, r, m |
+------------------+------+--------+-------------+----------------------------------------+

Total database accesses: 30

And by defining an accurate starting point :

 MATCH (n:Crew) WHERE n.name='Cypher' WITH n MATCH (n)-[:KNOWS]-(m) RETURN n,m

Result in the following execution plan :

Execution Plan
ColumnFilter
  |
  +SimplePatternMatcher
    |
    +Filter
      |
      +NodeByLabel

+----------------------+------+--------+-------------------+----------------------------------------+
|             Operator | Rows | DbHits |       Identifiers |                                  Other |
+----------------------+------+--------+-------------------+----------------------------------------+
|         ColumnFilter |    2 |      0 |                   |                      keep columns n, m |
| SimplePatternMatcher |    2 |      0 | m, n,   UNNAMED53 |                                        |
|               Filter |    1 |      8 |                   | Property(n,name(0)) == {  AUTOSTRING0} |
|          NodeByLabel |    4 |      5 |              n, n |                                  :Crew |
+----------------------+------+--------+-------------------+----------------------------------------+

Total database accesses: 13

As you can see, the first method use the traversal pattern, which is quite a bit exponantionnaly expensive with the amount of nodes, and you're doing a global match on the graph.

The second uses an explicit starting point, using the labels index.

EDIT

For the skills part, I would do something like this, if you have some test data to provide it could be more helpful for testing :

MATCH (country:Country { name: 'norway' })
WITH country
MATCH (country)<-[:LIVES_IN]-(person:Person)-[:HAS_SKILL]->(skill:Skill)
WHERE skill.name = 'java'
WITH person
OPTIONAL MATCH (person)-[:HAS_SKILL]->(skillb:Skill) WHERE skillb.name = 'javascript'
WITH person, skillb

There is no need for global lookups, as he already found persons, he just follows the "HAS_SKILL" relationships and filter on skill.name value

Edit 2:

Concerning your last edit, maybe this last part of the query :

MATCH (company) <- [:WORKED_AT] - (person:Person) 
, (person) - [:HAS_SKILL] -> (skill) 
, (person) - [:HAS_SKILL] -> (skill2) 
, (person) - [:LIVES_IN] -> (country)

Could be better written as :

MATCH (person:Person)-[:WORKED_AT]->(company)
WHERE (person)-[:HAS_SKILL]->(skill)
AND (person)-[:HAS_SKILL]->(skill2)
AND (person)-[:LIVES_IN]->(country)

Performance in Neo4j cypher query

Edit:

Edit 2:

Edit 3

Edit 4

Answers (1)

Related Questions