piyush daga
piyush daga

Reputation: 501

Unload data into redshift in a specific partition format

I have a large table that I need to unload to S3, partitioning it by year and month, such that the following folder structure is achieved.

bucket_name/folder_name/year/month/

As of now I'm doing this:

unload ( 'select *, extract(year from question_asked_at) as year, extract(month from question_asked_at) as month from schema_name.table_name'
to 's3://bucket_name/folder_name/'
iam_role <iam_role>
partition by (year, month);

The results are: bucket_name/folder_name/year=2001/month=01/, bucket_name/folder_name/year=2001/month=02/

The partitioning works but I need to remove the year= and month= prefixes, any suggestions?

Upvotes: 1

Views: 1486

Answers (1)

botchniaque
botchniaque

Reputation: 5084

The format partition_column=value is a convention coined by Hive. Redshift UNLOAD is following that convention (see Redshift manual for UNLOAD.

I think that to achieve your goal, you'd need to move files to new prefixes (without year= and month=) as a separate process, using eg. bash or python and some regex magic.


I have tried to scribble how to do that with boto3, and that's what I came up with:

import boto3
import re

s3 = boto3.resource("s3")

bucket_name = "sbochniak-zalon-eu-central-1"
prefix = "firehose_zalon_backend_events/"

keys = [
    o.key
    for o in
    s3.Bucket(bucket_name).objects.filter(Prefix=prefix).all()
]

new_keys = [
    re.sub('^(.*)year=(\w+)(.*)month=(\w+)(.*)$', r'\1\2\3\4\5', k)
    for k in
    keys
]

for old_key, new_key in zip(keys, new_keys):
    s3.Object(bucket_name, new_key).copy_from(CopySource={"Bucket": bucket_name, "Key": old_key})
    s3.Object(bucket_name, old_key).delete()

Upvotes: 1

Related Questions