Reputation: 315
I've run into a technical challenge around Neo4j usage that has had me stumped for a while. My organization uses Neo4j to model customer interaction patterns. The graph has grown to a size of around 2 million nodes and 7 million edges. All nodes and edges have between 5 and 10 metadata properties. Every day, we export data on all of our customers from Neo4j to a series of python processes that perform business logic.
Our original method of data export was to use paginated cypher queries to pull the data we needed. For each customer node, the cypher queries had to collect many types of surrounding nodes and edges so that the business logic could be performed with the necessary context. Unfortunately, as the size and density of the data grew, these paginated queries began to take too long to be practical.
Our current approach uses a custom Neo4j procedure to iterate over nodes, collect the necessary surrounding nodes and edges, serialize the data, and place it on a Kafka queue for downstream consumption. This method worked for some time but is now taking long enough so that it is also becoming impractical, especially considering that we expect the graph to grow an order of magnitude in size.
I have tried the cypher-for-apache-spark and neo4j-spark-connector projects, neither of which have been able to provide the query and data transfer speeds that we need.
We currently run on a single Neo4j instance with 32GB memory and 8 cores. Would a cluster help mitigate this issue?
Does anyone have any ideas or tips for how to perform this kind of data export? Any insight into the problem would be greatly appreciated!
Upvotes: 2
Views: 394
Reputation: 6318
Neo4j Enterprise supports clustering, you could use the Causal Cluster feature and launch as many read replicas as needed, run the queries in parallel on the read replicas, see this link: https://neo4j.com/docs/operations-manual/current/clustering/setup-new-cluster/#causal-clustering-add-read-replica
Upvotes: 0
Reputation: 2033
As far as I remember Neo4j doesn't support horizontal scaling and all data is stored in a single node. To use Spark you could try to store your graph in 2+ nodes and load the parts of the dataset from these separate nodes to "simulate" the parallelization. I don't know if it's supported in both of connectors you quote.
But as told in the comments of your question, maybe you could try an alternative approach. An idea:
It depends also on how often your data changes, how deep and breadth is your graph ?
Upvotes: 0