Reputation: 5660
I am trying to use terraform to scale RDS cluster for Aurora.
I am setting up an RDS instance with 3 servers - 1 writer and 2 read-replicas. Here is my requirement
when any of the servers fail, add a new server such that the replica always has a minimum of 3 servers.
when CPU usage of any host exceeds 50% then add a new server to the cluster. The Max number of servers is 4.
Is it possible to create a policy such that when any of the 3 servers fail, then create a new server for that RDS instance? If yes, how to monitor server failure?
Do I need to use appAutoScaling or use autoScaling or both? This is the link that matches my use-case : https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/appautoscaling_policy
Upvotes: 1
Views: 929
Reputation: 238957
I developed an example of an terraform config file for your question. It is ready to used but should be treated as an example only for learning and testing purposes. It was tested in us-east-1
region using a default VPC with terraform 0.13 and AWS provider 3.6.
The key resources created by the example terraform config file are:
Below I expand on the questions asked and the example config file.
The cluster will be provisioned with 1 writer and 2 replicas.
An application-auto-scaling which is based on TargetTrackingScaling
for RDSReaderAverageCPUUtilization
. The scaling policy is based on the replicas overall CPU utilization (50%), not its individual replicas.
This is a good practice, as aurora replicas are load balanced automatically at connection level. This means that the new connections will be roughly spread equally across available replicas, on condition that you are using reader enpoint.
Also any alarm or scaling policy which you may apply to individual replicas will be void once the replicas get replaced by scaling in/out activities or failures. This is because any scaling policy would be bound to a specific db instance. Once the instance is gone, the alarm will not work.
The alarms associated with the policy that the AWS creates on your behalf can be viewed in CLoudWatch Alarms Console.
If any db instance fails, Aurora will automatically proceed with fixing the problem, which can include restarting db instance, promoting a read replica as a new master, restring MySQL, or fully replacing a failed instance.
You can simulate these events yourself to some extend as described in Testing Amazon Aurora Using Fault Injection Queries .
Test failover to read replica
aws rds failover-db-cluster --db-cluster-identifier aurora-cluster-demo
Test crash of master instance
This will result in automated restart of the instance
mysql -h <endpoint> -u root -e "ALTER SYSTEM CRASH INSTANCE;"
Test crash of reader instance
This will result in restarting MySQL.
mysql -h <endpoint> -u root -e "ALTER SYSTEM SIMULATE 100 PERCENT READ REPLICA FAILURE TO ALL FOR INTERVAL 10 MINUTE;"
Test replacement of the reader
You can simulate total failure of the reader instance by manually deleting it in the console. Once deleted, Aurora will provision a replacement automatically.
You can use Amazon RDS Event Notification to automatically detect and respond to variety of events associated with your Aurora cluster and its instances. Failures are one of the events captured by the RDS Event Notification mechanism.
You can subscribe to a category of events of interest and receive notifications to SNS. Once the events are detected and published into SNS you can do what you want with it. Examples are, invoke a lambda event to analyze the event and the current state of your Aurora cluster, execute corrective actions or send email notifications.
For example, when you manually force failover as earlier, you will get an message with the following info (only fragment shown):
\"Event Message\":\"Started cross AZ failover to DB instance: aurora-cluster-demo-1\"
and later:
\"Event Message\":\"Completed failover to DB instance: aurora-cluster-demo-1\"}"
The example terraform config files subscribes to a number of categories. Thus you would have to fine-tune them to exactly what you require. You could also subscribe to all of them, and have a lambda function analyze them when as they happen and decide if they should be archived only, or the function should execute some automated procedures.
Aurora read replicates are scaled using application-auto-scaling, not AutoScaling (I assume here that you mean EC2 AutoScaling). EC2 AutoScaling is used only for regular EC2 instances, not for RDS.
provider "aws" {
# YOUR DATA
region = "us-east-1"
}
data "aws_vpc" "default" {
default = true
}
resource "aws_rds_cluster" "default" {
cluster_identifier = "aurora-cluster-demo"
engine = "aurora-mysql"
engine_version = "5.7.mysql_aurora.2.03.2"
database_name = "myauroradb"
master_username = "root"
master_password = "bar4343sfdf233"
vpc_security_group_ids = [aws_security_group.allow_mysql.id]
backup_retention_period = 1
skip_final_snapshot = true
}
resource "aws_rds_cluster_instance" "cluster_instances" {
count = 3
identifier = "aurora-cluster-demo-${count.index}"
cluster_identifier = aws_rds_cluster.default.id
instance_class = "db.t2.small"
publicly_accessible = true
engine = aws_rds_cluster.default.engine
engine_version = aws_rds_cluster.default.engine_version
}
resource "aws_security_group" "allow_mysql" {
name = "allow_mysql"
description = "Allow Mysql inbound Internet traffic"
vpc_id = data.aws_vpc.default.id
ingress {
description = "Mysql poert"
from_port = 3306
to_port = 3306
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_appautoscaling_target" "replicas" {
service_namespace = "rds"
scalable_dimension = "rds:cluster:ReadReplicaCount"
resource_id = "cluster:${aws_rds_cluster.default.id}"
min_capacity = 2
max_capacity = 4
}
resource "aws_appautoscaling_policy" "replicas" {
name = "cpu-auto-scaling"
service_namespace = aws_appautoscaling_target.replicas.service_namespace
scalable_dimension = aws_appautoscaling_target.replicas.scalable_dimension
resource_id = aws_appautoscaling_target.replicas.resource_id
policy_type = "TargetTrackingScaling"
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "RDSReaderAverageCPUUtilization"
}
target_value = 50
scale_in_cooldown = 300
scale_out_cooldown = 300
}
}
resource "aws_sns_topic" "default" {
name = "rds-events"
}
resource "aws_sqs_queue" "default" {
name = "aurora-notifications"
}
resource "aws_sns_topic_subscription" "user_updates_sqs_target" {
topic_arn = aws_sns_topic.default.arn
protocol = "sqs"
endpoint = aws_sqs_queue.default.arn
}
resource "aws_sqs_queue_policy" "test" {
queue_url = aws_sqs_queue.default.id
policy = <<POLICY
{
"Version": "2012-10-17",
"Id": "sqspolicy",
"Statement": [
{
"Sid": "First",
"Effect": "Allow",
"Principal": "*",
"Action": "sqs:SendMessage",
"Resource": "${aws_sqs_queue.default.arn}",
"Condition": {
"ArnEquals": {
"aws:SourceArn": "${aws_sns_topic.default.arn}"
}
}
}
]
}
POLICY
}
resource "aws_db_event_subscription" "cluster" {
name = "cluster-events"
sns_topic = aws_sns_topic.default.arn
source_type = "db-cluster"
event_categories = [
"failover", "failure", "deletion", "notification"
]
}
resource "aws_db_event_subscription" "instances" {
name = "instances-events"
sns_topic = aws_sns_topic.default.arn
source_type = "db-instance"
event_categories = [
"availability",
"deletion",
"failover",
"failure",
"low storage",
"maintenance",
"notification",
"read replica",
"recovery",
"restoration",
]
}
output "endpoint" {
value = aws_rds_cluster.default.endpoint
}
output "reader-endpoint" {
value = aws_rds_cluster.default.reader_endpoint
}
Upvotes: 8