Reputation: 682
I am trying to understand the relation between partitioning and predicate pushdown. Imagine we're using AWS Athena to work with data stored in S3 in the Parquet format. The dataset is partitioned by the date
and type
columns. It also contains a subtype
column, which is always NULL if type
is equal to X.
I'm trying to execute the query below
SELECT date, type, subtype FROM tablename WHERE date = '01012001' AND subtype IS NOT NULL;
The query scans the same amount of data as
SELECT date, type, subtype FROM tablename WHERE date = '01012001' AND type <> X AND subtype IS NOT NULL;
However, the latter query is significantly faster (ca.30%)
I am trying to understand what happens under the hood. My guess is that in the second case the executor scans only the partitions which are of given type and date whereas in the first one, it attempts to scan all files satisfying the date requirement regardless of their type. That requires more time even though, as the records with NULL subtype are skipped, the actual query cost is the same.
If someone could verify my guess and, if it's wrong, correct that would be most appreciated.
Upvotes: 1
Views: 1975
Reputation: 20770
You mentioned the data is partitioned on date
and type
.
In WHERE date = '01012001' AND type <> X AND subtype IS NOT NULL
kind of query, Presto (Athena) will need to access only the relevant files.
In WHERE date = '01012001' AND subtype IS NOT NULL
kind of query, Presto will access all files for given date
(for all type
values). Thanks to the information stored in Parquet file's metadata, a file can be skipped as not containing non-NULL
values of subtype
. However, fetching the metadata still takes time, as there is a latency involved when accessing the storage (S3).
Upvotes: 4