Reputation: 213
Our data is stored using s3://bucket/YYYY/MM/DD/HH
and we are using aws firehouse to land parquet data in there locations in near real time . I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled .
This is what I am doing :
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH
however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH
Given the stringent bucket paths of firehose i cannot modify the s3 paths. So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year
= or month
= ?
Upvotes: 2
Views: 7596
Reputation: 361
Now you can specify S3 prefix in firehose.https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
Upvotes: 3
Reputation: 179
If you can't obtain folder names as per hive naming convention, you will need to map all the partitions manually
ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'
Upvotes: 2