Reputation: 39
I have large number of CSV files (~12k) which are small (~250 records each). I want to load them to redshift cluster of size 3 in the same region, but it's taking a really long time.
The query I used in SQL Workbench/J is :
copy gsod from 's3://[path to folder]' access_key_id '******' secret_access_key '******' delimiter ',' BLANKSASNULL emptyasnull IGNOREHEADER 1 maxerror as 100000;
The query works in seconds if I use single file. But What's the best way to load all of them as quickly as possible ?
I have tried loading the files from s3 from the same region of the cluster.
copy gsod from 's3://[path to folder]' access_key_id '******' secret_access_key '******' delimiter ',' BLANKSASNULL emptyasnull IGNOREHEADER 1 maxerror as 100000;
Upvotes: 0
Views: 1597
Reputation: 5729
Go for manifest
file option. It will do really fast.
https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]}
This will do the copy
in parallel and instead of processing files one by one, it will load all the files in single attempt.
copy customer from 's3://mybucket/your-manifest-file' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;
Hope this help.
Upvotes: 2