Reputation: 201
I'm wondering how Athena partition projections handles folders containing CSV files with different fields. Let's say my bucket is partitioned by year, month and day as s3://SalesPurchases/{year}/{month}/{day}. Over each 15 min interval I generate two files describing sales and purchases made over that interval. The file names will be for example, Sales-15min-03:00.csv indicating that it was generated at 3am and holds the previous 15 minutes worth of updates. These files contain the following fields:
Sales: id, description, amount - (all strings)
Purchases: id, description, vendor, amount - (all strings)
If I create a table for Sales:
CREATE EXTERNAL TABLE sales_data (
id string,
description string,
amount string
)
PARTITIONED BY (
`year` string,
`month` string,
`day` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://SalesPurchases/'
TBLPROPERTIES (
"skip.header.line.count" = "1",
"projection.enabled" = "true",
"projection.year.type" = "integer",
"projection.year.range" = "2020,2021",
"projection.month.type" = "integer",
"projection.month.range" = "01,12",
"projection.day.type" = "integer",
"projection.day.range" = "01,31",
"storage.location.template" = "s3://SalesPurchases/${year}/${month}/${day}"
)
I'm just wondering about the efficiency of this approach. There is no indication in this CREATE statement to only filter on "Sales" files. When it comes time to actually querying the data for year=2020, month=01, day=01, I believe that all the Sales and Purchases files in that partition will be read. I'm beginning to believe that this is no different to how creating the Data Catalog entry via a glue crawler would be. They would both define the same schema information. However I do believe that partition projections are more optimal somehow.
Since all fields in Sales are a subset of the fields in Purchases, I'm also interested in understanding how this is handled.
One other oddity I noticed is that when I use the above create table statement, when it comes time to querying and I do something like
select * from sales_data limit 10
select count(*) from sales_data
I get zero results. I have to specifically include further filters in the query to get meaningful information e.g.
select * from sales_data where year = '2020' and month = '01' and day = '01' limit 10
Is this an oddity with partition projections? This however is not the case when using a crawler to generate the schema info. select * from sales_data limit 10
will return non-zero results in this case. Why is this?
Thanks.
Upvotes: 0
Views: 709
Reputation: 132882
Athena requires all files in a table to have the same schema. When it runs a query it will list and process all files in the S3 prefix given by the tables LOCATION
(or the table's partitions' LOCATION
). There is no way to tell it to filter files by name, it will always process all files. This works the same whether or not you use Partition Projection.
Glue Crawler is not the solution for this either, it will completely mess up if you use it with this setup.
Put your sales and purchases files in separate prefixes and create separate tables.
The reason your queries end up with zero results is that you need to configure the month and day partition keys to zero-fill:
"projection.month.digits" = "2"
"projection.day.digits" = "2"
It's confusing that the range allows zero-prefixed values even though it doesn't actually configure the range to be zero-prefixed.
Upvotes: 2