Reputation: 417
I have a scenario where I need to load data from Amazon S3 into Amazong Redshift database.
The files in S3 are all JSON which reside under various folders in a bucket. Each folder indicates the date it was generated.
For instance, This is a sample folder structure in S3 :
Each of the JSON files do not have a single root element. They have the following format -
file1.json
{
Key : Value,
Key1 : [ some array],
Key2 : value3,
}
{
Key : Value1,
Key1 : [ some array1],
Key2 : value2,
}
I want to know if there is a way to load these data recursively (as the json resides in multiple folders) into redshift using copy command.
(OR)
if there is any better method to load the data apart from using copy command.
Upvotes: 1
Views: 2082
Reputation: 1560
In S3, there is no such things as folders. Instead, paths are taken as object keys. As long as your keys follow a common pattern and assuming that there are no other files than those you want to import that match that pattern, for example:
s3
└── b1-bucket
├── f1/20160728/file1.json
├── f1/20160728/file2.json
├── f1/20160729/file1.json
└── f2/20160728/file1.json
Then the following COPY
command would match (and copy) all those files:
COPY your_table FROM 's3://b1-bucket/f' CREDENTIALS '' FORMAT AS JSON 'auto';
As stated in the documentation:
The s3://copy_from_s3_objectpath parameter can reference a single file or a set of objects or folders that have the same key prefix.
Another option would be to add a manifest file to the COPY
statement which is basically just a file that contains a list of the files you want to import:
{
"entries": [
{"url":"s3://b1-bucket/f1/20160728/file1.json", "mandatory":true},
{"url":"s3://b1-bucket/f1/20160728/file2.json", "mandatory":true},
{"url":"s3://b1-bucket/f1/20160729/file1.json", "mandatory":true},
{"url":"s3://b1-bucket/f2/20160728/file1.json", "mandatory":true}
]
}
See http://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html
Upvotes: 1