Reputation: 67968
I have a table of size 15 GB in DynamoDB. Now I need to transfer some data based on timestamps ( which is in db) to another DynamoDB. What would be the most efficient option here?
a) Transfer to S3,process with pandas or someway and put in the other table (data is huge. I feel this might take a huge time)
b) Through DataPipeLine (read a lot but don't think we can put queries over there)
c) Through EMR and Hive (this seems to be the best option but is it possible to do everything though a python script? Would I need to create an EMR Cluster and use it or create and terminate every time? How can EMR be used efficiently and cheaply as well?)
Upvotes: 1
Views: 994
Reputation: 3709
I would suggest going with the data pipeline into S3 approach. And then have a script to read from S3 and process your records. You can schedule this to run on regular intervals to backup all your data. I don't think that any solution that does a full scan will offer you a faster way, because it is always limited by read throughput.
Another possible approach is to use dynamoDB stream and lambdas to maintain second table in real time. Still you will first need to process existing 15 GB once using approach above, and then switch to lambdas for keeping them in sync
Upvotes: 1