Reputation: 1091
Seemingly cannot get Athena to partition projection to work.
When I add partitions the "old fashioned" way and then run a MSCK REPAIR TABLE testparts;
I can query the data.
I drop the table and recreate with the partition projections below and it fails to query any data at all. The queries that I do get to run take a very very long time with no results, or they time out like below query.
For the sake of argument I followed AWS documentation:
select distinct year from testparts;
I get :
HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mydb.testparts' can potentially read more than 1000000 partitions.
I have ~7500 files in there at the moment in the file structures indicated in the table setups below.
I have:
Tried entering the separated parts as date type, provided format "yyyy-MM-dd" and still it did not work (including deleting and changing my s3 structures as well). I then tried to split the dates into different folders and set as integers (which you see below) and still did not work.
Given I can get it to operate "manually" after repairing the table, then successfully querying my structures - I must be doing something wrong at a fundamental level with partition projections.
I have also changed user from injected type to enum (not ideal given it's a plain old string, but did it for the purpose of testing) Table creation :
CREATE EXTERNAL TABLE `testparts`(
`thisdata` array<struct<thistype:string,amount:float,identifiers:map<string,struct<id:string,type:string>>,selections:map<int,array<int>>>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`user` string,
`thisid` int,
`account` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://testoutputs/partitiontests/responses'
TBLPROPERTIES (
'classification'='json',
'projection.day.digits'='2',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.account.range'='1,50',
'projection.account.type'='integer',
'projection.month.digits'='2',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.thisid.range'='1,15',
'projection.thisid.type'='integer',
'projection.user.type'='enum',
'projection.user.values'='usera,userb',
'projection.year.range'='2017,2027',
'projection.year.type'='integer',
'storage.location.template'='s3://testoutputs/partitiontests/responses/year=${year}/month=${month}/day=${day}/user=${user}/thisid=${thisid}/account=${account}/',
'transient_lastDdlTime'='1653445805')
Upvotes: 5
Views: 4245
Reputation: 132862
If you run a query like SELECT * FROM testparts
Athena will generate all permutations of possible values for the partition keys and list the corresponding location on S3. For your table this means doing 11,160,000 listings.
I don't believe that there's any optimization for SELECT DISTINCT year FROM testparts
that would skip building the list of partition key values, so something similar would happen with that query too. Similarly, if you use "Preview table" to run SELECT * FROM testparts LIMIT 10
there is no optimization that skips building the list of partitons or skips listing the locations on S3.
Try running a query that doesn't wildcard any of the partition keys to validate that your config is correct.
Partition projection works differently from adding partitions to the catalog, and some care needs to be taken with wildcards. When partitions are in the catalog non-existent partitions can be eliminated cheaply, but with partition projection S3 has to be listed for every permutation of partition keys after predicates have been applied.
Partition projection works best when there are never wildcards on partition keys, to minimize the number of S3 listings that need to happen.
Upvotes: 5