ash_m
ash_m

Reputation: 51

AWS Athena: Partition projection using date-hour with mixed ranges

I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:

period item_id
2022/05 monthly_item_1
2022/05/04 daily_item_1
2022/05/04/02 hourly_item_1
2022/06 monthly_item_2

I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:

s3://bucket/prefix/2022/05/monthly_items.gz

s3://bucket/prefix/2022/05/04/daily_items.gz

s3://bucket/prefix/2022/05/04/02/hourly_items.gz

s3://bucket/prefix/2022/06/monthly_items.gz

I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.

I've created an Athena table using below query:

create external table `my_table`(
  `period` string COMMENT 'from deserializer',
  `item_id` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket/prefix/'
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.day.type'='integer',
  'projection.day.digits' = '2',
  'projection.day.range'='01,31', 
  'projection.hour.type'='integer',
  'projection.hour.digits' = '2',
  'projection.hour.range'='00,23', 
  'projection.month.type'='integer', 
  'projection.month.digits'='02', 
  'projection.month.range'='01,12',
  'projection.year.format'='yyyy', 
  'projection.year.range'='2022,NOW',  
  'projection.year.type'='date', 
  'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')

However, with this table running below query outputs zero results:

select * from my_table where year = '2022' and month = '06';

I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.

I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.

Would be great if anyone can help me out!

Upvotes: 1

Views: 1640

Answers (1)

Patrick W.
Patrick W.

Reputation: 149

make 'projection.year.type'='integer' instead of date.

Upvotes: 1

Related Questions