AWS Redshift: Load data from many buckets on S3

Question

I am trying to load data from two different buckets on S3 to Redshift table. In each bucket, there are directories with dates in their names and each of directories contains many files, but there are not manifest.

Example S3 structure:

# Bucket 1
s3://bucket1/20170201/part-01
s3://bucket1/20170201/part-02

s3://bucket1/20170202/part-01

s3://bucket1/20170203/part-00
s3://bucket1/20170203/part-01

# Bucket 2
s3://bucket2/20170201/part-00

s3://bucket2/20170202/part-00
s3://bucket2/20170202/part-01

s3://bucket2/20170203/part-00

Let's say that data from both buckets for 20170201 and 20170202 should be loaded. One of the solutions can be running 4 times COPY command - ones per each bucket-date pair. But I'm curious if it could be done within single COPY call. I've seen that manifest file allows specifying few different files (including from different buckets). However:

is there option to use prefix instead full path in the manifest,
and can I use somehow manifest in SQL passing it as a string instead file location - I want to avoid creating temporary files on S3?

John Rotenstein · Accepted Answer

You can use a manifest file to specify different buckets, paths and files.

The Using a Manifest to Specify Data Files documentation shows an example:

{
  "entries": [
    {"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
    {"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
    {"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
    {"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
  ]
}

The documentation also says:

The URL in the manifest must specify the bucket name and full object path for the file, not just a prefix.

The intent of using a manifest file is to know exactly which files have been loaded into Amazon Redshift. This is particularly useful when loading files that become available on a regular basis. For example, if files appear every 5 minutes and a COPY command was run to load the data from a given prefix, then it is unclear which files have already been loaded. This leads to potentially double-loading files.

The remedy is to use a manifest file that clearly specifies exactly which files to load. This obviously needs some code to find the files, create the manifest file and then trigger the COPY command.

It is not possible to load content from different buckets/paths without using a manifest file.

AWS Redshift: Load data from many buckets on S3

Answers (1)

Related Questions