alan
alan

Reputation: 6953

Terraform RDS: Restore data after destroy due to modification?

I am trying to create an RDS Aurora MySQL cluster in AWS using Terraform. However, I notice that any time I alter the cluster in a way the requires it to be replaced, all data is lost. I have configured to take a final snapshot and would like to restore from that snapshot, or restore the original data through an alternative measure.

Example: Change Cluster -> TF Destroys the original cluster -> TF Replaces with new cluster -> Restore Data from original

I have attempted to use the same snapshot identifier for both aws_rds_cluster.snapshot_identifier and aws_rds_cluster.final_snapshot_identifier, but Terraform bombs because the final snapshot of the destroyed cluster doesn't yet exist.

I've also attempted to use the rds-finalsnapshot module, but it turns out it is primarily used for spinning environments up and down, preserving the data. i.e. Destroying an entire cluster, then recreating it as part of a separate deployment. (Module: https://registry.terraform.io/modules/connect-group/rds-finalsnapshot/aws/latest)

module "snapshot_maintenance" {
  source="connect-group/rds-finalsnapshot/aws//modules/rds_snapshot_maintenance"    
  identifier                    = local.cluster_identifier
  is_cluster                    = true
  database_endpoint             = element(aws_rds_cluster_instance.cluster_instance.*.endpoint, 0)
  number_of_snapshots_to_retain = 3
}

resource "aws_rds_cluster" "provisioned_cluster" {
  cluster_identifier                  = module.snapshot_maintenance.identifier
  engine                              = "aurora-mysql"
  engine_version                      = "5.7.mysql_aurora.2.10.0"
  port                                = 1234
  database_name                       = "example" 
  master_username                     = "example"
  master_password                     = "example"
  iam_database_authentication_enabled = true 
  storage_encrypted                   = true
  backup_retention_period             = 2
  db_subnet_group_name                = "example"
  skip_final_snapshot                 = false
  final_snapshot_identifier           = module.snapshot_maintenance.final_snapshot_identifier
  snapshot_identifier                 = module.snapshot_maintenance.snapshot_to_restore 
  vpc_security_group_ids              = ["example"]
    
}

What I find is if a change requires destroy and recreation, I don't have a great way to restore the data as part of the same deployment.

I'll add that I don't think this is an issue with my code. It's more of a lifecycle limitation of TF. I believe I can't be the only person who wants to preserve the data in their cluster in the event TF determines the cluster must be recreated.

If I wanted to prevent loss of data due to a change to the cluster that results in a destroy, do I need to destroy the cluster outside of terraform or through the cli, sync up Terraform's state and then apply?

Upvotes: 2

Views: 3338

Answers (1)

alan
alan

Reputation: 6953

The solution ended up being rather simple, albeit obscure. I tried over 50 different approaches using combinations of existing resource properties, provisioners, null resources (with triggers) and external data blocks with AWS CLI commands and Powershell scripts.

The challenge here was that I needed to ensure the provisioning happened in this order to ensure no data loss:

  1. Stop DMS replication tasks from replicating more data into the database.
  2. Take a new snapshot of the cluster, once incoming data had been stopped.
  3. Destroy and recreate the cluster, using the snapshot_identifier to specify the snapshot taken in the previous step.
  4. Destroy and recreate the DMS tasks.

Of course these steps were based on how Terraform decided it needed to apply updates. It may determine it only needed to perform an in-place update; this wasn't my concern. I needed to handle scenarios where the resources were destroyed.

The final solution was to eliminate the use of external data blocks and go exclusively with local provisioners, because external data blocks would execute even when only running terraform plan. I used the local provisioners to tap into lifecycle events like "create" and "destroy" to ensure my Powershell scripts would only execute during terraform apply.

On my cluster, I set both final_snapshot_identifier and snapshot_identifier to the same value.

final_snapshot_identifier           = local.snapshot_identifier
snapshot_identifier    = data.external.check_for_first_run.result.isFirstRun == "true" ? null : local.snapshot_identifier

snapshot_identifier is only set after the first deployment, external data blocks allow me to check if a resource exists already in order to achieve the condition. The condition is necessary because on a first deployment, the snapshot won't exist and Terraform will fail during the "planning" step due to this.

Then I execute a Powershell script in a local provisioner on the "destroy" to stop any DMS tasks and then delete the snapshot by the name of local.snapshot_identifier.

  provisioner "local-exec" {
    when    = destroy
     # First, stop the inflow of data to the cluster by stopping the dms tasks.  
     # Next, we've tricked TF into thinking the snapshot we want to use is there by using the same name for old and new snapshots, but before we destroy the cluster, we need to delete the original.
     # Then TF will create the final snapshot immediately following the execution of the below script and it will be used to restore the cluster since we've set it as snapshot_identifier.
    command = "/powershell_scripts/stop_dms_tasks.ps1; aws rds delete-db-cluster-snapshot --db-cluster-snapshot-identifier benefitsystem-cluster"
    interpreter = ["PowerShell"]
  }

This clears out the last snapshot and allows Terraform to create a new final snapshot by the same name as the original, just in time to be used to restore from.

Now, I can run Terraform the first time and get a brand-new cluster. All subsequent deployments will use the final snapshot to restore from and data is preserved.

Upvotes: 4

Related Questions