Reputation: 71
I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE
statement to a single CSV.
The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?
Upvotes: 6
Views: 20649
Reputation: 497
Athena stores query results in Amazon S3. A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)
Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call. (https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)
Upvotes: 1
Reputation: 2668
If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.
CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://my_athena_results/ctas_query_result_bucketed/',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1)
AS (
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa
);
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].
See Athena documentation for more information on its syntax and what can be specified within WITH
directive. Also, don't forget about
considerations and limitations for CTAS Queries, e.g. the external_location
for storing CTAS query results in Amazon S3 must be empty etc.
CTAS
statements are compressed with GZIP
algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv
file extension, but you still will be able to open them with text editors.You wont' be able to preserve column names inside files if you save them in csv
format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.
If you want to preserve column names in the output files after executing CTAS
queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format
property within WITH
clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.
Upvotes: 11