monomonedula
monomonedula

Reputation: 644

Does "limit" reduce the amount of scanned data on AWS Athena?

I have S3 with compressed JSON data partitioned by year/month/day. I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:

...
AND year = 2020
AND month = 10 
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1

Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?

Upvotes: 1

Views: 2853

Answers (1)

Philipp Johannis
Philipp Johannis

Reputation: 2956

Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.

It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select * and select col1,col2,.. in this case).

ORDER BY definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.

LIMIT will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.

In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):

  • partition columns in WHERE clause reduces the amount of scanned data
  • LIMIT further reduces the amount of scanned data in some cases
  • ORDER BY leads to reading the all partitions again because it otherwise can't be sorted

Upvotes: 2

Related Questions