Viet Hoang
Viet Hoang

Reputation: 13

Is there any performance-wise better option for exporting data to local than Redshift unload via s3?

I'm working on a Spring project that needs exporting Redshift table data into local a single CSV file. The current approach is to:

  1. Execute Redshift UNLOAD to write data across multiple files to S3 via JDBC
  2. Download said files from S3 to local
  3. Joining them together into one single CSV file
UNLOAD (
  'SELECT DISTINCT #{#TYPE_ID} 
  FROM target_audience 
  WHERE #{#TYPE_ID} is not null 
  AND #{#TYPE_ID} != \'\' 
  GROUP BY #{#TYPE_ID}'
) 
TO '#{#s3basepath}#{#s3jobpath}target_audience#{#unique}_' 
credentials 'aws_access_key_id=#{#accesskey};aws_secret_access_key=#{#secretkey}' 
DELIMITER AS ',' ESCAPE GZIP ;

The above approach has been fine and all. But i think the overall performance can be improved by, for example skipping the S3 part and get data directly from Redshift to local.

After searching through online resources, i found that you can export data from redshift directly through psql or to perform SELECT queries and move the result data myself. But neither option can top Redshift UNLOAD performance with parallel writing.

So is there any way i can mimic UNLOAD parallel writing to achieve the same performance without having to go through S3 ?

Upvotes: 1

Views: 2539

Answers (2)

John Rotenstein
John Rotenstein

Reputation: 270124

You can avoid the need to join files together by using UNLOAD with the PARALLEL OFF parameter. It will output only one file.

This will, however, create multiple files if the filesize exceeds 6.2GB.

See: UNLOAD - Amazon Redshift

It is doubtful that you would get better performance by running psql, but if performance is important for you then you can certainly test the various methods.

Upvotes: 0

Red Boy
Red Boy

Reputation: 5739

We do exactly same as you'r trying to do here. In our performance comparison, it found to be almost same or even better in some cases in our user case. Hence programming and debugging wise its easy. As there is practically one step.

//replace user/password,host,region,dbname appropriately in given command
psql postgresql://user:[email protected]:5439/dbname?sslmode=require -c "select C1,C2 from sch1.tab1" > ABC.csv

This enables us to avoid 3 steps,

  1. Unload using JDBC
  2. Download the exported Data from S3
  3. Decompress gzip file, (this we used to save network Input/Output).

On other hand also saving some cost(S3 storing, though its negligible). By the way, pgsql(9.0+) onwards, sslcompression is bydefault on.

Upvotes: 0

Related Questions