Reputation: 517
We occasionally need to perform maintenance or upgrade procedures on our EC2 instances, particularly ones running database software. The first step in these procedures is typically "backup your volumes before proceeding!"
When we take a backup, we use AWS Backup to generate an AMI for the machine along with volume snapshots.
Our process thus far has been to wait for volume snapshots to enter the "Available (100%)" state before we move forward with the rest of a maintenance procedure. However, this can significantly increase the length of the maintenance window, since it can take an hour or more for snapshots to become fully available in some cases.
However, the snapshot documentation seems to indicate that snapshots are taken immediately at the point in time the snapshot command is issued:
Snapshots occur asynchronously; the point-in-time snapshot is created immediately, but the status of the snapshot is pending until the snapshot is complete (when all of the modified blocks have been transferred to Amazon S3), which can take several hours for large initial snapshots or subsequent snapshots where many blocks have changed. While it is completing, an in-progress snapshot is not affected by ongoing reads and writes to the volume.
To me, this implies that it may NOT necessary to wait until the snapshot has fully transferred to S3 (and transitioned to "Available" state) before continuing onward to take additional "potentially dangerous" maintenance actions on the volume (software updates, migrations, etc that might have a chance to go wrong or corrupt the data).
When a procedure tells you to create volume snapshots to ensure you have recent backups before proceeding, is it sufficient to create a snapshot that is in the "Pending" state, or is it necessary to wait until the snapshot transitions to the "Available" state? At what point can you say that the data on the volume is safely backed up? Have there been cases where snapshots fail to reach the "Available" state due to some error or other corruption?
Upvotes: 1
Views: 908
Reputation: 425471
When a procedure tells you to create volume snapshots to ensure you have recent backups before proceeding, is it sufficient to create a snapshot that is in the "Pending" state, or is it necessary to wait until the snapshot transitions to the "Available" state?
This depends on the failure model: the scenario the things can plausibly go wrong, and how to mitigate it.
The failure model they seem to apply here is the EC2 instance becoming unusable after the upgrade: a software bug in the updated software, service incompatibility or something like that.
The mitigation procedure is restoring the EC2 instance to its pre-upgrade state.
To be able to restore the instance to the pre-upgrade state, you need to have a snapshot. The snapshot isn't there until it transitions to the Available state. It will (probably, but see below) eventually complete, but you won't be able to dump it to an EBS volume and attach it to your EC2 instance until it completes. During this time, your instance will be unusable.
If waiting for the snapshot to complete before the possible rollback, and assuming the risk that it will not complete, is an acceptable scenario for your failure model, then by all means go for it.
Note that rollback from an S3 backup also takes time. AWS offers Fast Snapshot Restore (pre-warmed snapshots already sitting on inactive EBS volumes) for an additional price.
At what point can you say that the data on the volume is safely backed up?
Once you have been able to restore it.
Not when it says "Pending", not when it says "Available", not at any other time before you used it to make a successful restore.
Have there been cases where snapshots fail to reach the "Available" state due to some error or other corruption?
Yes, a lot.
EBS volumes can and do fail all the time. AWS has a special EBS state for exactly this event and explains how to deal with it.
S3 has much higher redundancy than EBS, but even S3 can fail spectacularly.
If either your EBS volume or S3 fail during the pending state, the snapshot will never complete, and, thus, you will never be able to use it for successful restore.
Upvotes: 2