Reputation: 816
My case is that:
I have many csv files uploaded into S3 and have a set of external tables defined in Athena to access the corresponding folders under S3 Bucket. As the data volume is increasing , my boss would like to partition the data in order to save money and increase the performance.
My questions are:
According to the doc, partitioned data will be in Hive format. So all my csv files will be changed into Hive format?
The total data volume at S3 will be increased as both csv and hive format files are stored?
In no file format change(csv to hive) is need, the partition just means placing the data in different sub-folder or sub-db based on the specific csv column? But if this is the case, when I set 2 partition based on 2 csv columns, which are not correlated, the data will be duplicate due to the partitioning?
New to AWS, have no idea how partitioning can be done at Athena. Detailed explanation is highly appreciated.
Upvotes: 1
Views: 3666
Reputation: 2298
To answer your questions in order:
You can partition data as you like and keep a csv file format. Athena leverages hive for partitioning, but partitioning in and of itself does not change the data type. (You could use AWS Glue to setup a workflow to convert data between Parquet, CSV, and other supported formats.) You can compress your csv files using gzip (or another supported compression algorithm) to meet your objectives of reducing scan sizes and improving performance.
As you are not storing multiple copies of the data (unless you want to). Storage will not increase. (You could archive the non-partitioned data using Amazon Glacier).
A better way to think of partitioning in Athena is using "tags" instead of using using "folders." As you gain experience using S3 and start making API calls or using the CLI, you will find that S3 has no concept of a folder (despite the fact that the console has a button that says create folder).
A detailed explanation can be found here. Consider the following csv file, cities.csv
:
1,New York,8500000
2,Boston,673184
3,Philadelphia,1568000
4,Baltimore,621849
5,Washington DC,681170
Using AWK (or the Powershell equivalent) we can break this file into 5 files.
$ awk -F "," '{OFS=","; print $2,$3>"cities_data_"$1".csv"}' cities.csv
$ ls
cities.csv cities_data_1.csv cities_data_2.csv cities_data_3.csv cities_data_4.csv cities_data_5.csv
We can now upload these files to S3 and tag them with the partition value (put them in a folder labelled by partition).
We can now run DDL from the console (or CLI or API):
CREATE EXTERNAL TABLE cities (
name string,
population int
) PARTITIONED BY (
id_partitioned int
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://<YOURBUCKETNAME>/';
Then load the partitions:
MSCK REPAIR TABLE cities;
And now we can query based on those partitions:
SELECT
name,
population
FROM cities
where id_partitioned = 1
You can experiment with the where clause and see the effect on scan sizes.
Upvotes: 2