Nizam Mohideen
Nizam Mohideen

Reputation: 893

DynamoDB - How to do incremental backup?

I am using DynamoDB tables with keys and throughput optimized for application use cases. To support other ad hoc administrative and reporting use cases I want to keep a complete backup in S3 (a day old backup is OK). Again, I cannot afford to scan the entire DynamoDB tables to do the backup. The keys I have are not sufficient to find out what is "new". How do I do incremental backups? Do I have to modify my DynamoDB schema, or add extra tables just to do this? Any best practices?

Update: DynamoDB Streams solves this problem.

DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table, and stores this information in a log for up to 24 hours. Applications can access this log and view the data items as they appeared before and after they were modified, in near real time.

Upvotes: 8

Views: 8545

Answers (5)

Abhaya Chauhan
Abhaya Chauhan

Reputation: 1201

For incremental backups, you can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)

A lambda function you can use to tie up with DynamoDb for incremental backups:

https://github.com/PageUpPeopleOrg/dynamodb-replicator

I've provided a detailed walk through on how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:

https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups

Alternatively, DynamoDB just realised On Demand backups and restores. They aren't incremental, but full backup snapshots.

Check out https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/ for more information.

HTH

Upvotes: 4

mkobit
mkobit

Reputation: 47239

On November 29th, 2017 On-Demand Backup was introduced. It allows for you to create backups directly in DynamoDB essentially instantly without consuming any capacity. Here are a few snippets from the blog post:

This feature is designed to help you to comply with regulatory requirements for long-term archival and data retention. You can create a backup with a click (or an API call) without consuming your provisioned throughput capacity or impacting the responsiveness of your application. Backups are stored in a highly durable fashion and can be used to create fresh tables.

...

The backup is available right away! It is encrypted with an Amazon-managed key and includes all of the table data, provisioned capacity settings, Local and Global Secondary Index settings, and Streams. It does not include Auto Scaling or TTL settings, tags, IAM policies, CloudWatch metrics, or CloudWatch Alarms.

You may be wondering how this operation can be instant, given that some of our customers have tables approaching half of a petabyte. Behind the scenes, DynamoDB takes full snapshots and saves all change logs. Taking a backup is as simple as saving a timestamp along with the current metadata for the table.

Upvotes: 3

Sergey Grigoriev
Sergey Grigoriev

Reputation: 719

A scan operation in DynamoDB returns rows sorted by primary key (hash key). So if a table's hash key is an auto-incremented integer, then set the hash key of the last record saved during the previous backup as "lastEvaluatedKey" parameter for a scan request when doing the next backup, and the scan will return records which have been created since the last backup only.

Upvotes: 0

Shitanshu Verma
Shitanshu Verma

Reputation: 41

You can now use dynamoDB streams to have data persisted into anthother table or maintain another copy of data in another datastore.

https://aws.amazon.com/blogs/aws/dynamodb-streams-preview/

Upvotes: 4

Steven Hood
Steven Hood

Reputation: 688

I see two options:

  1. Generate the current snapshot. You'll have to read from the table to do this, which you can do at a very slow rate to stay under your capacity limits (Scan operation). Then, keep an in-memory list of updates performed for some period of time. You could put these in another table, but you'll have to read those, too, which would probably cost just as much. This time interval could be a minute, 10 minutes, an hour, whatever you're comfortable losing if your application exits. Then, periodically grab your snapshot from S3, replay these changes on the snapshot, and upload your new snapshot. I don't know how large your data set is, so this may not be practical, but I've seen this done with great success for data sets up to 1-2GB.

  2. Add read throughput and backup your data using a full scan every day. You say you can't afford it, but it isn't clear if you mean paying for capacity, or that the scan would use up all the capacity and the application would begin failing. The only way to pull data out of DynamoDB is to read it, either strongly or eventually consistent. If the backup is part of your business requirements, then I think you have to determine if it's worth it. You can self-throttle your read by examining the ConsumedCapacityUnits property on your results. The Scan operation has a Limit property that you can use to limit the amount of data read in each operation. Scan also uses eventually consistent reads, which are half the price of strongly consistent reads.

Upvotes: 6

Related Questions