Reputation: 644
I have S3 with compressed JSON data partitioned by year/month/day. I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:
...
AND year = 2020
AND month = 10
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1
Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?
Upvotes: 1
Views: 2853
Reputation: 2956
Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.
It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select *
and select col1,col2,..
in this case).
ORDER BY
definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.
LIMIT
will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.
In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):
partition columns
in WHERE clause reduces the amount of scanned dataLIMIT
further reduces the amount of scanned data in some casesORDER BY
leads to reading the all partitions again because it otherwise can't be sortedUpvotes: 2