Dipayan
Dipayan

Reputation: 213

Hive partitioning for data on s3

Our data is stored using s3://bucket/YYYY/MM/DD/HH and we are using aws firehouse to land parquet data in there locations in near real time . I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled .

This is what I am doing : PARTITIONED BY ( `year` string, `month` string, `day` string, `hour` string)

This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH

however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH

Given the stringent bucket paths of firehose i cannot modify the s3 paths. So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year = or month= ?

Upvotes: 2

Views: 7596

Answers (2)

Andy Yi Cui
Andy Yi Cui

Reputation: 361

Now you can specify S3 prefix in firehose.https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html

myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

Upvotes: 3

Sai Dileep
Sai Dileep

Reputation: 179

If you can't obtain folder names as per hive naming convention, you will need to map all the partitions manually

ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'

Upvotes: 2

Related Questions