tyvich
tyvich

Reputation: 580

Athena Partition Projection for Date column vs. String

I'm looking to use Athena Partition Projection to analyze log files from AWS application load balancers and firehose emitted logs. The data in S3 is prefixed with year/month/day and potentially hour as well. I've been able to accomplish using the Firehose Example; however this example uses a string formatted partition column.

I'm looking to see if it's possible to use a date formatted partition column instead (with partition project and the firehose emitted s3 prefix format), as our query writers are already used to most of our queries involving date columns and it avoids the need to string format for relative date queries. Is this possible or would the s3 prefixes need to be changed to accomplish?

Table Properties for String column: WORKS

PARTITIONED BY ( 
  `logdate` string)
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.logdate.format'='yyyy/MM/dd', 
  'projection.logdate.interval'='1', 
  'projection.logdate.interval.unit'='DAYS', 
  'projection.logdate.range'='NOW-2YEARS,NOW', 
  'projection.logdate.type'='date',
  'storage.location.template'='s3://bucket/prefix/${logdate}')

Table Properties for Date Partition column Does Not Work

PARTITIONED BY ( 
  `logdate` date)
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.logdate.format'='yyyy/MM/dd', 
  'projection.logdate.interval'='1', 
  'projection.logdate.interval.unit'='DAYS', 
  'projection.logdate.range'='NOW-2YEARS,NOW', 
  'projection.logdate.type'='date',
  'storage.location.template'='s3://bucket/prefix/${logdate}')

HIVE_INVALID_PARTITION_VALUE: Invalid partition value '2018/11/13' for DATE partition key: logdate=2018%2F11%2F13

Upvotes: 7

Views: 6258

Answers (1)

Theo
Theo

Reputation: 132882

I think the only thing you need to do is make sure the type of the logdate partition key to be string:

PARTITIONED BY (logdate string)

This is not the same as projection.logdate.type, which should continue to be date.

Partition keys with type date are just dates within the calculations partition projection performs. For all other purposes they are strings. PP will parse values using the date format you specify, do its calculations, then output strings using the same date format. This happens during query planning, before the Presto engine is involved.

Presto's schema-on-read approach means that you can say that a column has type date if its format matches the expected format of dates; yyyy-MM-dd in Java format. The format that you get from Firehose's S3 keys, yyyy/MM/dd, can't be cast to date automatically, it needs to be explicitly parsed:

parse_datetime(logdate, 'yyyy/MM/dd')

I think it would have been great if PP would have been aware of the types of partition keys so that you could have done what you have tried to do, but I assume that since PP happens during query planning and most likely not anywhere near where the types of things are known it's probably too difficult to achieve.

Upvotes: 10

Related Questions