Sedhu
Sedhu

Reputation: 773

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path

parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY ( 
  `parent` string, 
  `year` int, 
  `month` tinyint, 
  `date` tinyint)

Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?

SELECT *  
 FROM table 
 WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;

But this isn't correct. Please help

Upvotes: 1

Views: 6442

Answers (2)

Theo
Theo

Reputation: 132862

Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.

Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.

I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.

I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.

What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:

ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'

The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:

ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'

For your specific case you could have:

PARTITIONED BY (`parent` string, `day` date)

and then do:

ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'

Upvotes: 9

Ilya Kisil
Ilya Kisil

Reputation: 2658

Here is how you can use year, month and day values the come from partitions in order to select date range

SELECT col1, col2
FROM my_table 
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
                             CAST(month AS VARCHAR(2)),'-',
                             CAST(day AS VARCHAR(2))
                             ), '%Y-%m-%d') as DATE) 
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'

You can add additional filter statements as needed)

Upvotes: 7

Related Questions