Reputation: 67450
I've been reading this aws blog article and it made sense to me up until the part where it talks about partitions. The query it uses to create the table looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs_raw_native_part (
request_timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string )
PARTITIONED BY(year string, month string, day string) -- Where does Athena get this data?
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1','input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://athena-examples/elb/raw/';
What's confusing to me about this is how it's saying, "partition by year (among other things)", but nowhere else in that "SQL" does it specify the part of the data that's a year. Also, none of these column names have a type of date. So how does Athena know how to partition this data when you haven't told it which part of the data is the year, month, or day?
In the context of the blog article, it says the year comes from the file name, but there was no step to tell Athena that information. The article says this is the pre-defined format: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format but that doesn't have a year column that I can see.
Edit: The article was not very explicit about this, but I think it might be saying that every PARTITIONED BY
column is a sub-dir inside the s3 bucket? In other words, the first element in the PARTITION BY
clause (year
in this case) is the first sub-dir of the bucket, and so on.
That only makes partial sense to me, because the same article says, "You can partition your data across multiple dimensions―e.g., month, week, day, hour, or customer ID―or all of them together." I don't understand how you could do all of that if they come from sub-dirs unless you had a ton of duplication in your bucket.
Upvotes: 4
Views: 3650
Reputation: 6649
Very interesting finding Daniel! It reminds me of an old discussion I had with AWS support about this topic. I want to post an excerpt of it here, maybe someone finds it useful:
I just read the Athena documentation about partitioning data in S3 [1].
I wonder about the sample which is given in "Scenario 1: Data already partitioned and stored on S3 in hive format", "Storing Partitioned Data":
the command "aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/" returns e.g. "PRE dt=2009-04-12-13-00/".Is my assumption correct, that in order to be able to partition the data in Athena automatically, I have to prefix my S3 folder names with "partition_key=actual_folder_name"?
Otherwise I do not understand why the example ls-command above returns S3 keys which start with the "dt=" prefix. I think it should be better documented at this point in the Athena documentation what "data on S3 in hive format" means. [...]References:
[1] https://docs.aws.amazon.com/athena/latest/ug/partitions.html
I understand that you have some questions about partitioning in Athena as per AWS documentation https://docs.aws.amazon.com/athena/latest/ug/partitions.html
To answer your question: Is my assumption correct, that in order to be able to partition the data in Athena automatically, I have to prefix my S3 folder names with "partition_key=actual_folder_name"?
Yes, you are correct that in order to detect the partitions automatically by Athena, the S3 prefixes should be in 'key=value' pair. Otherwise you have to add all those partitions manually by using 'Alter Table .. Add Partition' command as mentioned in the above documentation itself.
I understand that you found our documentation to be less verbose with respect to Hive style partitioning. [...] However, the reason behind less description about Hive partitioning is due to the fact that Hive being an open source tool have open source documentation available explaining Hive style partitioning in detail. For e.g. link[1], etc.
If you find changing S3 naming or adding partitions manually a tedious task at your end due to its manual nature, I can suggest you using AWS Glue crawler[2] to create a Athena table on your S3 data. Glue will detect the partitions even in non-Hive style partitioning and will assign Keys to the partitions like 'partition_0', 'partition_1', etc. [...]
References:
[1] http://hadooptutorial.info/partitioning-in-hive/
[2] https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
Upvotes: 1
Reputation: 67450
This article, which was linked, explains it better than the original.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data. There are two scenarios discussed in the following sections:
Data is already partitioned, stored on Amazon S3, and you need to access the data on Athena.
Data is not partitioned.
My question was about the 1st. For that, it says
Partitions are stored in separate folders in Amazon S3. For example, here is the partial listing for sample ad impressions:
aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/ PRE dt=2009-04-12-14-10/ PRE dt=2009-04-12-14-15/ PRE dt=2009-04-12-14-20/ PRE dt=2009-04-12-15-00/ PRE dt=2009-04-12-15-05/
Here, logs are stored with the column name (dt) set equal to date, hour, and minute increments. When you give a DDL with the location of the parent folder, the schema, and the name of the partitioned column, Athena can query data in those subfolders.
The failing of the article (IMO) is that it never showed an aws s3 ls
. If it did, I wouldn't have been confused. In the article, there is assumed to be an S3 key called year, month, and day. The PARTITION BY
is referring to those keys.
If your files aren't organized as neatly as this, you can use a different sql statement to read it in and partition it (scenario 2 mentioned above):
ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/'
Upvotes: 4