omri_saadon
omri_saadon

Reputation: 10631

Unload all table from redshift to s3 - cpu usage

The goal is to unload a few tables (for each customer) every few hours to s3 in parquet format

Each table is around 1GB (CSV format), in parquet it is around 120MB

The issue is when running 2-3 parallel unloads commands the cpu of the redshift nodes goes to 98%-100% in the cluster.

This is the command I use:

unload ('select * from my_table')
to 's3://test_bucket'
iam_role 'my_arn'
allowoverwrite
format PARQUET;

I also tried to unload the data as CSV but the nodes also went to around 98-100% CPU.

I would like to save a snapshot of the whole table (a few tables) every few hours and I must have all the columns so I have to use select *.

Is there another approach I might take that would not raise my redshift CPU so high?

The node type is ra3.xlplus with 4 nodes.

enter image description here

Upvotes: 0

Views: 698

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269490

It might be worth experimenting to see whether extracts run faster by creating an External Table in Amazon Redshift and then using INSERT (external table) - Amazon Redshift to load data into it?

You could either write a query that just INSERTs incremental data by looking at what is already there (very easy if there is a unique identifier in a row, or an increasing timestamp). Otherwise, you would first TRUNCATE the external table and the INSERT into it (but I haven't tried that, so I'm not sure whether TRUNCATE would work on an external table).

Upvotes: 0

Related Questions