Reputation: 23
The main question:
I can't seem to find definitive info about how $path works when used in a where clause in Athena.
select * from <database>.<table> where $path = 'know/path/'
Given a table definition at the top level of a bucket, if there are no partitions specified but the bucket is organized using prefixes does it scan the whole table? Or does it limit the scan to the specified path in a similar way to partitions? Any reference to an official statement on this?
The specific case:
I have information being stored in s3, this information needs to be counted and queried once or twice a day, the prefixes are two different IDs (s3:bucket/IDvalue1/IDvalue2/) and then the file with the relevant data. On a given day any number of new folders might be created (on busy days it could be day tens of thousands) or new files added to existing prefixes. So, maintaining the partition catalog up to date seems a little complicated.
One proposed approach to avoid partitions is using $path when getting data from a know combination of IDs, but I cannot seem to find whether using such approach would actually limit the amount of data scanned per query. I read a comment saying it does not but I cannot find it in the documentation and was wondering if anyone knows how it works and can point to the proper reference.
So far googling and reading the docs has not clarified this.
Upvotes: 2
Views: 1255
Reputation: 132922
Update in 2023: Since I wrote the answer below Athena has added support for only reading files that match predicates on $path
.
Athena does not have any optimisation for limiting the files scanned when using $path
in a query. You can verify this for yourself by running SELECT * FROM some_table
and SELECT * FROM some_table WHERE $path = '…'
and comparing the bytes scanned (they will be the same, if there was an optimisation they would be different – assuming there is more than one file of course).
See Query by "$path" field and Athena: $path vs. partition
For your use case I suggest using partition projection with the injected type. This way you can limit the prefixes on S3 that Athena will scan, while at the same time not have to explicitly add partitions.
You could use something like the following table properties to set it up (use the actual column names in place of id_col_1
and id_col_2
, obviously):
CREATE EXTERNAL TABLE some_table
…
TBLPROPERTIES (
"projection.id_col_1.type" = "injected",
"projection.id_col_2.type" = "injected",
"storage.location.template" = "s3://bucket/${id_col_1}/${id_col_2}/"
)
Note that when querying a table that uses partition projection with the injected type all queries must contain explicit values for the the projected columns.
Upvotes: 8