AWS Athena- reduce scan size

Question

How to reduce to data scanned size for 'select' query in AWS athena. By scanning only one of the column.

Example: SELECT * FROM TABLE1 WHERE STATUS='Fail';

Zerodf · Accepted Answer

The simplest way to reduce the scan size would be to partition based on the data by the STATUS value.

See the user guide for information about partitioning. However, you may want to consider a columnar format such as Apache Parquet as well, which is a columnar data storage and interchange format which is supported by Athena.

Using a columnar format is helpful because Athena will only read the columns it must to satisfy the query. For a SELECT * query it usually won't make much of a difference, but the I/O savings can be substantial if you're only interested in a few columns out of dozens or hundreds. In addition, Parquet (and ORC, a competing columnar format also supported by Athena) support compression, so even when all columns are accessed it's still quite a savings over uncompressed CSV or JSON.

AWS Athena- reduce scan size

Answers (2)

Related Questions