setting up Neo4j replication on two instances

Question

I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:

Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.

OR

Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.

I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).

Please let me know

which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.

I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !

FylmTM · Accepted Answer

So. You already mentioned available solutions.

TL;DR; I prefer first option.

Cluster

In general, recommended layout is 3 nodes (2 slaves + 1 master). But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.

Good things:

Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.

Notes:

If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.

Backup

Good things:

Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.

Notes:

Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.

How backup is performed?

One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.

When transaction in database is committed, all changes are appended to transaction log.

When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.

Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.

Relevant settings:

Other

Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.

In general - everything depends on your load and database size.

If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.

If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.

setting up Neo4j replication on two instances

Answers (1)

Cluster

Backup

Other

Related Questions