ttz
ttz

Reputation: 79

MySQL Replication: Question about a fallback-system

I want to set up a complete server (apache, mysql 5.7) as a fallback of a productive server. The synchronization on file level using rsync and cronjob is already done.

The mysql-replication is currently the problem. More precisely: the choice of the right replica method.

Multi primary group replication seemed to be the most suitable method so far. In case of a longer production downtime, it is possible to switch to the fallback server quickly via DNS change. Write accesses to the database are possible immediately without adjustments.

So far so good: But, if the fallback-server fails, it is in unreachable status and the production-server switches to read only, since its group no longer has the quota. This is of course a no-go. I thought it might be possible using different replica variables: If the fallback-server is in unreachable state for a certain time (~5 minutes), the production-server should stop the group_replication and start a new group_replication. This has to happen automatically to keep the read-only time relatively low. When the fallback-server is back online, it should be manually added to the newly started group. But if I read the various forum posts and documentation correctly, it's not possible that way. And running a Group_Replication with only two nodes is the wrong decision anyway.

https://forums.mysql.com/read.php?177,657333,657343#msg-657343

Is the master - slave replication the only one that can be considered for such a fallback system? https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-switch.html

Or does the Group_Replication offer possibilities after all, if you can react suitably to the quota problem? Possibilities that I have overlooked so far.

Many thanks and best regards

Upvotes: 2

Views: 664

Answers (2)

Ali Momeni
Ali Momeni

Reputation: 490

I recommend using the Galera MySQL cluster with HAProxy as a load balancer and automatic failover solution. we have used it in production for a long time now and never had serious problems. The most important thing to consider is monitoring the replication sync status between nodes. also, make sure your storage engine is InnoDB because Galera doesn't work with MyISAM.

check this link on how to setup : https://medium.com/platformer-blog/highly-available-mysql-with-galera-and-haproxy-e9b55b839fe0

But in these kinds of situations, the main problem is not a failover mechanism because there are many solutions out of the box, but rather you have to check your read/write ratio and transactional services and make sure replication delays won't affect them. some times vertically scalable solutions with master-slave replication are more suitable for transaction-sensitive financial systems and it really depends on the service your providing.

Upvotes: 1

Rick James
Rick James

Reputation: 142528

Short Answer: You must have [at least] 3 nodes.

Long Answer:

Split brain with only two nodes:

  • Write only to the surviving node, but only if you can conclude that it is the only surviving node, else...
  • The network died and both Primaries are accepting writes. This to them disagreeing with each other. You may have no clean way to repair the mess.
  • Go into readonly mode with surviving node. (The only safe and sane approach.)

The problem is that the automated system cannot tell the difference between a dead Primary and a dead network.

So... You must have 3 nodes to safely avoid "split-brain" and have a good chance of an automated failover. This also implies that no two nodes should be in the same tornado path, flood range, volcano path, earthquake fault, etc.

You picked Group Replication (InnoDB Cluster). That is an excellent offering from MySQL. Galera with MariaDB is an equally good offering -- there are a lot of differences in the details, but it boils down to needing 3, preferably dispersed, nodes.

DNS changes take some time, due to the TTL. A proxy server may help with this.

Galera can run in a "Primary + Replicas" mode, but it also allows you to run with all nodes being read-write. This leads to a slightly different set of steps necessary for a client to take to stop writing to one node and start writing to another. There are "Proxys" to help with such.

FailBack

Are you trying to always use a certain Primary except when it is down? Or can you accept letting any node be the 'current' Primary?

I think of "fallback" as simply a "failover" that goes back to the original Primary. That implies a second outage (possibly briefer). However, I understand geographic considerations. You may want your main Primary to be 'near' most of your customers.

Upvotes: 1

Related Questions