Dee
Dee

Reputation: 27

Neo4j cyper performance on simple match

I have a very simple cypher which give me a poor performance. I have approx. 2 million user and 60 book category with relation from user to category around 28 million. When I do this cypher:

MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);

It returns me 8.5k rows within 2 - 2.5 (First time) minutes

And when I do this cypher:

MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;

It return 55k rows within 3 to 6 (First time) minutes.

I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?

Upvotes: 0

Views: 52

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41676

Can you explain your model a bit?

Where are the books and the "reading"-Event in it?

Afaik all you want to know, which book categories have been recently read (in the last month)?

You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).

WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b) 
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;

WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;

There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.

With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.

But I wouldn't really recommend to go down this route.

Upvotes: 0

FylmTM
FylmTM

Reputation: 1997

First of all, you can profile your query, to find what happens under the hood.

Currently looks like that query scans all nodes in database to complete query.

Reasons:

  • Neo4j support indexes only for '=' operation (or 'IN')
  • To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp

There is no straightforward way to deal with this problem. You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.

You can take look on graphaware/neo4j-timetree library.

Upvotes: 1

Related Questions