Neo4j Large Scale Aggregation - sub-second time possible?

Question

Our team is currently evaluating Neo4j, and graph databases as a whole, as a candidate for our backend solution.

The upsides - the flexible data model, fast traversals in a native graph store - are all very applicable to our problem space.

However, we also have a need to perform large scale aggregations on our datasets. I'm testing a very simple use case with a simple data model: (s: Specimen)-[d: DONOR]->(d: DONOR)

A Specimen has an edge relating it to a Donor.

The dataset I loaded has ~6 million Specimens, and a few hundred Donors. The aggregation query I want to perform is simple:

MATCH (s: Specimen)-[e: DONOR]->(d: Donor) 
WITH d.sex AS sex, COUNT(s.id) AS count 
RETURN count, sex

The performance time is very slow - the result does not return for ~9 seconds. We need sub-second return times for this solution to work.

We are running Neo4j on an EC2 instance with 32vCPU units and 256GB of memory, so compute power shouldn't be a blocker here. The database itself is only 15GB.

We also have indexes on both the Specimen and Donor nodes, as well as an index on the Donor.sex property.

Any suggestions on improving the query times? Or are Graph Databases simply not cut out for such large-scale aggregations?

Neo4j Large Scale Aggregation - sub-second time possible?

Answers (1)

Related Questions