johna
johna

Reputation: 21

Reliability of MySQL master-slave replication

I have a an application that requires a master catalogue of about 30 tables which require to be copied out to many (100+) slave copies of the application. Slaves may be in their own DB instance or there may be multiple slaves in single DB instances. Any changes to the Master catalogue require to be copied out to the slaves within a reasonable time - around 5 minutes. Our infrastructure is all AWS EC2 and we use MySQL. Master and slaves will all reside within a single AWS region.

I had planned to use Master-Slave replication but I see reports of MySQL replication being sometimes unreliable and I am not sure if this is due to failings inherent in the particular implementations or failings in MySQL itself. We need a highly automated and reliable system and it may be that we have to develop monitoring scripts that allow a slave to continuously monitor its catalogue relative to the master.

Any observations?

Upvotes: 2

Views: 3091

Answers (1)

Bill Karwin
Bill Karwin

Reputation: 562230

When I was taking dance lessons before my wedding, the instructor said, "You don't have to do every step perfectly, you just have to learn to recover gracefully when missteps happen. If you can do that quickly, with a smile on your face, no one will notice."

If you have 100+ replicas, expect that you will be reinitializing replicas frequently, probably at least one or two every day. This is normal.

All software has bugs. Expecting anything different is, frankly, naive. Don't expect software to be flawless and continue operating 24/7 indefinitely without errors, because you will be disappointed. You should not seek a perfect solution, you should think like a dancer and recover gracefully.

MySQL replication is reasonably stable, and no less so than other solutions. But there are a variety of failures that can happen, without it being MySQL's fault.

  • Binlogs can develop corrupted packets in transit due to network glitches. MySQL 5.6 introduced binlog checksums to detect this.

  • The master instance can crash and fail to write an event to the binlog. sync_binlog can help to ensure all transactions are written to the binlog on commit (though with overhead for transactions).

  • Replica data can fall out of sync due to non-deterministic SQL statements, or packet corruption, or log corruption on disk, or some user can change data directly on a replica. Percona's pt-table-checksum can detect this, and pt-table-sync can correct errors. Using binlog_format=ROW reduces the chance of non-deterministic changes. Setting the replicas read-only can help, and don't let users have SUPER privilege.

  • Resources can run out. For example, you could fill up the disk on the master or the replica.

  • Replicas can fall behind, if they can't keep up with the changes on the master. Make sure your replica instances are not under-powered. Use binlog_format=ROW. Write fewer changes to an individual MySQL master. MySQL 5.6 introduces multi-threaded replicas, but so far I've seen some cases where this is still a bit buggy, so test carefully.

  • Replicas can be offline for an extended time, and when they come back online, some of the master's binlogs have been expired so the replica can't replay a continuous stream of events from where it left off. In that case, you should trash the replica and reinitialize it.

  • Bugs happen in any software project, and MySQL's replication has had their share. You should keep reading release notes of MySQL, and be prepared to upgrade to take advantage of bug fixes.

Managing a big collection of database servers in continuous operation takes a significant amount of full-time work, no matter what brand of database you use. But data has become the lifeblood of most businesses, so it's necessary to manage this resource. MySQL is no better and no worse than any other brand of database, and if anyone tells you something different, they're selling something.

P.S.: I'd like to hear why you think you need 100+ replicas in a single AWS region, because that is probably overkill by an order of magnitude for any goal of high availability or scaling.

Upvotes: 7

Related Questions