user3490715
user3490715

Reputation: 71

Export athena table to S3 as one readable file

I am baffled: I cannot figure out how to export a sucessfully run CREATE TABLE statement to a single CSV.

The query "saves" the result of my Create Table command in an appropriately named S3 bucket, partitioned into 60 (!) files. Alas, these files are not readable text files

CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid AS 
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_aaid
UNION ALL
SELECT *
FROM targetsmart_idl_data_pa_mi_deduped_idfa

How can I save this table to S3, as a single file, CSV format, without having to download and re-upload it?

Upvotes: 6

Views: 20649

Answers (2)

P. Phalak
P. Phalak

Reputation: 497

Athena stores query results in Amazon S3. A results file stored automatically in a CSV format (*.csv) .So results can be exported into a csv file without CREATE TABLE statement (https://docs.aws.amazon.com/athena/latest/ug/querying.html)

Execute athena query using StartQueryExecution API and results .csv can be found at the output location specified in api call. (https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html)

Upvotes: 1

Ilya Kisil
Ilya Kisil

Reputation: 2668

If you want a result of CTAS query statement being written into a single file, then you would need to use bucketing by one of the columns you have in your resulting table. In order to get resulting files in csv format, you would need to specify tables' format and field delimiter properties.

CREATE TABLE targetsmart_idl_data_pa_mi_deduped_maid 
WITH (
      format = 'TEXTFILE',
      field_delimiter = ',', 
      external_location = 's3://my_athena_results/ctas_query_result_bucketed/', 
      bucketed_by = ARRAY['__SOME_COLUMN__'], 
      bucket_count = 1) 
AS (
    SELECT *
    FROM targetsmart_idl_data_pa_mi_deduped_aaid
    UNION ALL
    SELECT *
    FROM targetsmart_idl_data_pa_mi_deduped_idfa
);

Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. Note, that even explicitly specifying a bucket size of one, might still get multiple files [1].

See Athena documentation for more information on its syntax and what can be specified within WITH directive. Also, don't forget about considerations and limitations for CTAS Queries, e.g. the external_location for storing CTAS query results in Amazon S3 must be empty etc.

Update 2019-08-13

  • Apparently, the result of CTAS statements are compressed with GZIP algorithm by default. I couldn't find in documentation how to change this behavior. So, all you would need is to uncompress it after you had downloaded it locally. NOTE: uncompressed files won't have .csv file extension, but you still will be able to open them with text editors.

Update 2019-08-14

  • You wont' be able to preserve column names inside files if you save them in csv format. Instead, they would be specified in AWS Glue meta-data catalog, together with other information about a newly created table.

  • If you want to preserve column names in the output files after executing CTAS queries, then you should consider file formats which inherently do that, e.g. JSON, Parquet etc. You can do that by using format property within WITH clause. Choice of file format really depends on a use case and size of data. Go with JSON if your files are relatively small and you want to download and be able to read their content virtually from anywhere. If files are big and you are planning to keep them on S3 and query with Athena, then go with Parquet.

Upvotes: 11

Related Questions