Reputation: 301
I have a big table in redshift I need to automate the process of archiving monthly data.
The current approach is as follows(Manual):
I need to automate this approach,
Is using aws data pipeline a good approach?
Please suggest any other effective approach, examples appreciated.
Thanks for the help!
Upvotes: 0
Views: 3421
Reputation: 513
I do not know if this is a situation Amazon considers "common" enough to suggest practices, but here are several options (Note: the popular consensus seems to be that data pipeline is good for simple data loading, but it isn't intended as an archive method.)
Create a snapshot each month and then remove data from that table. You can use snapshot API actions or console scheduling that could be automated and your snapshots would have date stamps.
Copy the data over to an external table in S3 and then delete from the Redshift table. I think you can use Spectrum for this.
Use a third party backup solution such as N2WS, Panopoly.
Use AWS Glue (or perhaps Data Pipelines, I haven't used that) to pull out the table structure/data and then truncate the original table.
Use an external ETL product to do the same as Glue. Some are free, or your firm might already use one.
I didn't come across any best practices around this type of data copy.
Upvotes: 1
Reputation: 4354
My suggested approach is to set up airflow in a small instance to run the scheduling. or if that is too much work set up a crontab.
I suggest using gzip format and limiting the size to about 20-100mb per file
That way the data is outside redshift but can be accessed from redshift whenever you need it.
Upvotes: 2