Reputation: 10631
The goal is to unload a few tables (for each customer) every few hours to s3 in parquet format
Each table is around 1GB (CSV format), in parquet it is around 120MB
The issue is when running 2-3 parallel unloads commands the cpu of the redshift nodes goes to 98%-100% in the cluster.
This is the command I use:
unload ('select * from my_table')
to 's3://test_bucket'
iam_role 'my_arn'
allowoverwrite
format PARQUET;
I also tried to unload the data as CSV but the nodes also went to around 98-100% CPU.
I would like to save a snapshot of the whole table (a few tables) every few hours and I must have all the columns so I have to use select *
.
Is there another approach I might take that would not raise my redshift CPU so high?
The node type is ra3.xlplus
with 4 nodes.
Upvotes: 0
Views: 698
Reputation: 269490
It might be worth experimenting to see whether extracts run faster by creating an External Table in Amazon Redshift and then using INSERT (external table) - Amazon Redshift to load data into it?
You could either write a query that just INSERTs incremental data by looking at what is already there (very easy if there is a unique identifier in a row, or an increasing timestamp). Otherwise, you would first TRUNCATE
the external table and the INSERT
into it (but I haven't tried that, so I'm not sure whether TRUNCATE
would work on an external table).
Upvotes: 0