How can I speed up this Athena Query?

Question

I am running a query through the Athena Query Editor on a table in the Glue Data Catalog and would like to understand why it takes so long to do a simple select * from this data.

Our data is stored in an S3 bucket that is partitioned by year/month/day/hour, with 80 snappy Parquet files per partition that are anywhere between 1 - 10 MB in size each. When I run the following query:

select stringA, stringB, timestampA, timestampB, bigintA, bigintB
from tableA
where year='2021' and month='2' and day = '2'

It scans 700MB but takes over 3 minutes to display the Athena results. I feel that we have already optimized the file format and partitioning for this data, and so I am unsure how else we can improve the performance if we're just trying to select this data out and display it in a tool like QuickSight.

giantbrioche · Accepted Answer

The select * performance was impacted by the number of files that needed to be scanned, which were all relatively small. Repartitioning and removing the hour partition resulted in an improvement in both runtime (14% reduction) and also data scanned (26% reduction) due to snappy compression getting more gains on larger files.

Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

How can I speed up this Athena Query?

Answers (1)

Related Questions