darekarsam
darekarsam

Reputation: 301

Best approach to automate archiving aws-redshift table

I have a big table in redshift I need to automate the process of archiving monthly data.

The current approach is as follows(Manual):

  1. unload the redshift query result to s3
  2. create new backup table
  3. copy files from s3 to redshift table
  4. remove data from the original table

I need to automate this approach,
Is using aws data pipeline a good approach?
Please suggest any other effective approach, examples appreciated.

Thanks for the help!

Upvotes: 0

Views: 3421

Answers (2)

Aby Sheffer
Aby Sheffer

Reputation: 513

I do not know if this is a situation Amazon considers "common" enough to suggest practices, but here are several options (Note: the popular consensus seems to be that data pipeline is good for simple data loading, but it isn't intended as an archive method.)

  1. Create a snapshot each month and then remove data from that table. You can use snapshot API actions or console scheduling that could be automated and your snapshots would have date stamps.

  2. Copy the data over to an external table in S3 and then delete from the Redshift table. I think you can use Spectrum for this.

  3. Use a third party backup solution such as N2WS, Panopoly.

  4. Use AWS Glue (or perhaps Data Pipelines, I haven't used that) to pull out the table structure/data and then truncate the original table.

  5. Use an external ETL product to do the same as Glue. Some are free, or your firm might already use one.

I didn't come across any best practices around this type of data copy.

Upvotes: 1

Jon Scott
Jon Scott

Reputation: 4354

My suggested approach is to set up airflow in a small instance to run the scheduling. or if that is too much work set up a crontab.

  1. using the redshift unload command, copy the data that you want to archive to s3, use a sub folder for each archive (e.g. monthly - use the year and month as the folder name)
  2. delete the data from your redshift table.
  3. set up a redshift spectrum external table definition for that data in s3, you can set that up to include all of the subfolders if you wish.

I suggest using gzip format and limiting the size to about 20-100mb per file

That way the data is outside redshift but can be accessed from redshift whenever you need it.

Upvotes: 2

Related Questions