Reputation: 501
I have a large table that I need to unload to S3, partitioning it by year and month
, such that the following folder structure is achieved.
bucket_name/folder_name/year/month/
As of now I'm doing this:
unload ( 'select *, extract(year from question_asked_at) as year, extract(month from question_asked_at) as month from schema_name.table_name'
to 's3://bucket_name/folder_name/'
iam_role <iam_role>
partition by (year, month);
The results are:
bucket_name/folder_name/year=2001/month=01/
,
bucket_name/folder_name/year=2001/month=02/
The partitioning works but I need to remove the year=
and month=
prefixes, any suggestions?
Upvotes: 1
Views: 1486
Reputation: 5084
The format partition_column=value
is a convention coined by Hive. Redshift UNLOAD
is following that convention (see Redshift manual for UNLOAD
.
I think that to achieve your goal, you'd need to move files to new prefixes (without year=
and month=
) as a separate process, using eg. bash or python and some regex magic.
I have tried to scribble how to do that with boto3, and that's what I came up with:
import boto3
import re
s3 = boto3.resource("s3")
bucket_name = "sbochniak-zalon-eu-central-1"
prefix = "firehose_zalon_backend_events/"
keys = [
o.key
for o in
s3.Bucket(bucket_name).objects.filter(Prefix=prefix).all()
]
new_keys = [
re.sub('^(.*)year=(\w+)(.*)month=(\w+)(.*)$', r'\1\2\3\4\5', k)
for k in
keys
]
for old_key, new_key in zip(keys, new_keys):
s3.Object(bucket_name, new_key).copy_from(CopySource={"Bucket": bucket_name, "Key": old_key})
s3.Object(bucket_name, old_key).delete()
Upvotes: 1