Mark K Cowan
Mark K Cowan

Reputation: 1863

Redshift: Possibility to specify suffix for paths when doing PARTITIONED UNLOAD to S3?

Is there any way to provide a suffix for paths when doing a partitioned unload to S3?

e.g. if I want to use the output of +several+ queries for batch jobs, where query outputs are partitioned by date.

Currently I have a structure in S3 like:

s3://bucket/path/queryA/key=1/ *.parquet
s3://bucket/path/queryA/key=2/ *.parquet
s3://bucket/path/queryB/key=1/ *.parquet
s3://bucket/path/queryB/key=2/ *.parquet

But ideally, I would like to have:

s3://bucket/path/key=1/queryA/ *.parquet
s3://bucket/path/key=2/queryA/ *.parquet
s3://bucket/path/key=1/queryB/ *.parquet
s3://bucket/path/key=2/queryB/ *.parquet

So that I can then use as input paths to batch processing jobs (e.g. on Sagemaker!):

s3://bucket/path/key=1/
s3://bucket/path/key=2/

Such that each batch job has the output of all queries for the particular day that the batch job is computing for.

Currently, I re-shape the data in S3 after unloading but it would be much faster and more convenient if I could specify a suffix for Redshift to append to S3 unload paths, +after+ the partition suffix.

From the UNLOAD docs I'm assuming that this isn't possible, and I'm unable to post on AWS forums.

But perhaps there's some other command or a connection variable that I can use, a hack involving something like a literal value for a second partition key, or a totally different strategy altogether?

Upvotes: 3

Views: 2299

Answers (1)

botchniaque
botchniaque

Reputation: 5084

You could add an artificial column q to mark the query, and then use it as a second partition - that would effectively add a q=queryA prefix to your path.

BUT, redshift does not allow to UNLOAD into a non-empty location, unless you provide an ALLOWOVERWRITE option.

Then, since you don't control the unloaded filenames (they'll depend on the slice count and max file size) allowing overwrite may cause your data to really be overwritten if you happen to have same partition keys.

To work around that, you could add one more artificial partitioning column which would add a unique component to your path (same value for each unload). I used RANDOM in my example for that - you could use something which is more clash-safe.

Below is an example query, which unloads data without overwriting results even if unloaded multiple times. I ran it for different part and q values.

unload ($$

    WITH
    rand(rand) as (select md5(random())),
    input(val, part) as (
        select 1, 'p1' union all
        select 1, 'p2'
    )
    SELECT
        val,
        part,
        'queryB' as q,
        rand as r
    FROM input, rand

    $$)
TO 's3://XXX/partitioned_unload/'
IAM_ROLE 'XXX'
PARTITION by (part, q, r)
ALLOWOVERWRITE

These are the files produced by 3 runs:

aws s3 ls s3://XXX/partitioned_unload/ --recursive
2020-06-29 08:29:14          2 partitioned_unload/part=p1/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0001_part_00
2020-06-29 08:28:58          2 partitioned_unload/part=p1/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00
2020-06-29 08:29:54          2 partitioned_unload/part=p1/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:54          2 partitioned_unload/part=p2/q=queryB/r=24a4976a535a584dabdf8861548772d4/0001_part_00
2020-06-29 08:29:14          2 partitioned_unload/part=p3/q=queryA/r=b43e3ff9b6b271387e2ca5424c310bb5/0002_part_00
2020-06-29 08:28:58          2 partitioned_unload/part=p3/q=queryA/r=cfcd208495d565ef66e7dff9f98764da/0001_part_00

Upvotes: 4

Related Questions