gauthama
gauthama

Reputation: 51

Neo4j Large Scale Aggregation - sub-second time possible?

Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.

The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.

However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)

A Specimen has an edge relating it to a Donor.

The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:

MATCH (s: Specimen)-[e: DONOR]->(d: Donor) 
WITH d.sex AS sex, COUNT(s.id) AS count 
RETURN count, sex

The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.

We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.

We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.

Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?

Upvotes: 0

Views: 264

Answers (1)

Lju
Lju

Reputation: 590

You will more than likely need to refactor your graph model. For example, you may want to investigate if using multiple labels (e.g. something like Specimen:Male/Specimen:Female) if it is appropriate to do so, as this will act as a pre-filter before scanning the db.

You may find the following blog posts helpful:

Upvotes: 1

Related Questions