AWS Athena, Parquet and predicate pushdown

Question

I am trying to understand the relation between partitioning and predicate pushdown. Imagine we're using AWS Athena to work with data stored in S3 in the Parquet format. The dataset is partitioned by the date and type columns. It also contains a subtype column, which is always NULL if type is equal to X.

I'm trying to execute the query below

SELECT date, type, subtype FROM tablename WHERE date = '01012001' AND subtype IS NOT NULL;

The query scans the same amount of data as

SELECT date, type, subtype FROM tablename WHERE date = '01012001' AND type <> X AND subtype IS NOT NULL;

However, the latter query is significantly faster (ca.30%)

I am trying to understand what happens under the hood. My guess is that in the second case the executor scans only the partitions which are of given type and date whereas in the first one, it attempts to scan all files satisfying the date requirement regardless of their type. That requires more time even though, as the records with NULL subtype are skipped, the actual query cost is the same.

If someone could verify my guess and, if it's wrong, correct that would be most appreciated.

Piotr Findeisen · Accepted Answer

You mentioned the data is partitioned on date and type.

In WHERE date = '01012001' AND type <> X AND subtype IS NOT NULL kind of query, Presto (Athena) will need to access only the relevant files.

In WHERE date = '01012001' AND subtype IS NOT NULL kind of query, Presto will access all files for given date (for all type values). Thanks to the information stored in Parquet file's metadata, a file can be skipped as not containing non-NULL values of subtype. However, fetching the metadata still takes time, as there is a latency involved when accessing the storage (S3).

AWS Athena, Parquet and predicate pushdown

Answers (1)

Related Questions