Athena query is very slow

Question

I am storing 400,000 parquet files in S3 that are partitioned based on a unique id (e.g. 412812). The files range in size from 25kb to 250kb of data. I then want to query the data using Athena. Like so,

Select * 
From Table 
where id in (412812, 412813, 412814)

This query is much slower than anticipated. I want to be able to search for any set of ids and get a fast response. I believe it is slow is because Athena must search through the entire glue catalog looking for the right file (i.e., a full scan of files).

The following query is extremely fast. Less than a second.

Select * 
From Table 
where id = 412812

partition.filtering is enabled on the table. I tried adding an index to the table that was the same as the partition, but it did not speed anything up.

Is there something wrong with my approach or a table configuration that would make this process faster?

John Rotenstein · Accepted Answer

Your basic problem is that you have too many files and too many partitions.

While Amazon Athena does operate in parallel, there are limits to how many files it can process simultaneously. Plus, each extra file adds overhead for listing, opening, etc.

Also, putting just a single file in each partition greatly adds to the overhead of handling so many partitions and is probably counterproductive for increasing the efficiency of the system.

I have no idea of how you actually use your data, but based on your description I would recommend that you create a new table that is bucketed_by the id, rather than partitioned:

CREATE TABLE new_table
WITH (
 format = 'PARQUET',
 parquet_compression = 'SNAPPY',
 external_location = 's3://bucket/new_location/',
 bucketed_by = ARRAY['id']
)
AS SELECT * FROM existing_table

Let Athena create as many files as it likes -- it will optimize based upon the amount of data. More importantly, it will create larger files that allow it to operate more efficiently.

See: Bucketing vs Partitioning - Amazon Athena

In general, partitions are great when you can divide the data into some major subsets (eg by country, state or something that represents a sizeable chunk of your data), while bucketing is better for fields that have values that are relatively uncommon (eg user IDs). Bucketing will create multiple files and Athena will be smart enough to know which files contain the IDs you want. However, it will not be partitioned into subdirectories based upon those values.

Creating this new table will greatly reduce the number of files that Amazon Athena would need to process for each query, which will make your queries run a LOT faster.

Athena query is very slow

Answers (1)

Related Questions