Reputation: 413
I'm trying to create a disaster recovery plan for a cost efficient, maintainable and with little down for Aurora MySQL.
I want two read/write databases in two different regions, they can be separate databases called primary-us-east-1 and backup-us-east-2. I also want bidirectional replication between primary-us-east-1 to backup-us-east-2. Only one database will be connected to at all times so collisions are not a concern. In the event that region us-east-1 goes down, all I have to do is trigger a DNS switch to point to us-east-2 since backup-us-east-2 is already updated.
I've looked into Aurora Global Databases but this requires promoting a read replica in a secondary region to a master and then updating the DNS to recover from a region outage. I like the 0 work for data replication across several regions but I don't like losing the maintainability of the new resources in the process because the newly created resources (clusters/replicas) won't be maintainable in CDK if created through a lambda or by hand.
Is what I'm asking for possible? If yes, does anyone know of a replication solution so data can be copied primary-us-east-1 between backup-us-east-2?
UPDATE 1:
A potential solution is standing up the Aurora MySQL resources primary-us-east-1 and backup-us-east-2 using cdk. Keep them in sync using AWS Database Migration Service for continuous replication. Use a lambda to detect a region outage which will then perform the dns switch to point to backup-us-east-2. The only follow up task would be bringing primary-us-east-1 in sync with backup-us-east-2.
Upvotes: 2
Views: 1012
Reputation: 562681
Whole region outages are very rare (see https://awsmaniac.com/aws-outages/). I would be cautious about how much effort you invest in trying to automate detection and failover for such cases. It's a lot of work to do this, if it's possible at all. It's extremely hard to do this right, it's hard to test and hard to keep working. Lots of potential for false-positive failover events, or out of control flip-flopping. Whole companies have started up and failed trying to create fully automated failover solutions. I would bet that even the FAANG companies don't achieve it, but rely on site reliability engineers to respond to outages.
IMO, it's more cost-effective to develop a nicely written runbook for manual cutover to the other region, and then make sure your staff practice region failover periodically. This ensures the docs are kept up to date, the tools work, and the team is familiar with the steps.
DNS updates are slow. What I would recommend instead is some sort of proxy server, so your apps can use a single endpoint, and the proxy can switch which database on the back-end to use dynamically. This is basically what MySQL Router is for, and I've also done a proof of concept with Envoy Proxy (sorry I don't have access to that code anymore), and I suppose you could do the same thing with ProxySQL.
My opinion is that AWS still has potential for improvement with respect to failover for RDS and Aurora. It works, but it can cause long downtimes on the order of several minutes. So it's hardly an improvement over manual failover. That is, some oncall engineer gets paged, checks out some dashboards to confirm that it's a legitimate outage, and then executes the runbook to do a manual failover.
Upvotes: 0