Suhaib Ahmed
Suhaib Ahmed

Reputation: 39

How do I load large number of small CSV files from s3 to redshift?

I have large number of CSV files (~12k) which are small (~250 records each). I want to load them to redshift cluster of size 3 in the same region, but it's taking a really long time.

The query I used in SQL Workbench/J is :

copy gsod from 's3://[path to folder]' access_key_id '******' secret_access_key '******' delimiter ',' BLANKSASNULL emptyasnull IGNOREHEADER 1 maxerror as 100000;

The query works in seconds if I use single file. But What's the best way to load all of them as quickly as possible ?

I have tried loading the files from s3 from the same region of the cluster.

copy gsod from 's3://[path to folder]' access_key_id '******' secret_access_key '******' delimiter ',' BLANKSASNULL emptyasnull IGNOREHEADER 1 maxerror as 100000;

Upvotes: 0

Views: 1597

Answers (1)

Red Boy
Red Boy

Reputation: 5729

Go for manifest file option. It will do really fast.

https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html

  {
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]}

This will do the copy in parallel and instead of processing files one by one, it will load all the files in single attempt.

copy customer from 's3://mybucket/your-manifest-file' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Hope this help.

Upvotes: 2

Related Questions