FirstName
FirstName

Reputation: 417

Loading data from s3 to redshift using copy command

I have a scenario where I need to load data from Amazon S3 into Amazong Redshift database.

The files in S3 are all JSON which reside under various folders in a bucket. Each folder indicates the date it was generated.

For instance, This is a sample folder structure in S3 :

  1. Bucket (B1)
    A. Folder (F1)
    i. file1.json, file2.json ... so on B. Folder (F2)
    ii. file22.json, file23.json ... so on

Each of the JSON files do not have a single root element. They have the following format -

file1.json

{
Key : Value,
Key1 : [ some  array],
Key2 : value3,
}

{
Key : Value1,
Key1 : [ some  array1],
Key2 : value2,
}

I want to know if there is a way to load these data recursively (as the json resides in multiple folders) into redshift using copy command.

(OR)

if there is any better method to load the data apart from using copy command.

Upvotes: 1

Views: 2082

Answers (1)

moertel
moertel

Reputation: 1560

Option 1: Key Prefix Matching

In S3, there is no such things as folders. Instead, paths are taken as object keys. As long as your keys follow a common pattern and assuming that there are no other files than those you want to import that match that pattern, for example:

s3
└── b1-bucket
    ├── f1/20160728/file1.json
    ├── f1/20160728/file2.json
    ├── f1/20160729/file1.json
    └── f2/20160728/file1.json

Then the following COPY command would match (and copy) all those files:

COPY your_table FROM 's3://b1-bucket/f' CREDENTIALS '' FORMAT AS JSON 'auto';

As stated in the documentation:

The s3://copy_from_s3_objectpath parameter can reference a single file or a set of objects or folders that have the same key prefix.

Option 2: Manifest File

Another option would be to add a manifest file to the COPY statement which is basically just a file that contains a list of the files you want to import:

{
  "entries": [
    {"url":"s3://b1-bucket/f1/20160728/file1.json", "mandatory":true},
    {"url":"s3://b1-bucket/f1/20160728/file2.json", "mandatory":true},
    {"url":"s3://b1-bucket/f1/20160729/file1.json", "mandatory":true},
    {"url":"s3://b1-bucket/f2/20160728/file1.json", "mandatory":true}
  ]
}

See http://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html

Upvotes: 1

Related Questions