Chicago1988
Chicago1988

Reputation: 684

query from athena a file in s3

I am complete newbie on this. I have this file in amazon s3.

How can I query this .tar.gz from Athena?

enter image description here

I am assuming I have to somehow decompress and ‘restore’ to ‘athena’? But I do not know how to do it.

Upvotes: 1

Views: 3471

Answers (2)

Harvesting Data
Harvesting Data

Reputation: 170

You can directly query files in AWS Athena that are in .gz format as well as any flat files. If your tar file contains multiple .gz files and they are of the same file format then you don't need to gunzip them to .tsv.

Since, you have already converted to .tsv files make sure the files of the same format are put into a folder e.g. s3://bucketname/folder/file1.gz s3://bucketname/folder/file2.gz

etc. file1 and file2 should have the same structure.

Then define your AWS Athena table on top of this. Sample script below -

CREATE EXTERNAL TABLE table_name (
yr INT,
 quarter INT,
 month INT,
 dayofmonth INT,
 dayofweek INT,
 flightdate STRING
)
 PARTITIONED BY (year STRING)
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
 LOCATION 's3://bucketname/folder/';

Keeping homogeneous files is not mandatory but recommended so that you can add remove files under the same folder and just update the partition information every time there is a change.

Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table.

MSCK REPAIR TABLE table_name ;

Reference - https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html#tsv-example

Upvotes: 2

Exelian
Exelian

Reputation: 5888

You can't query tarballs. Athena requires gzipped or uncompressed text-files. Other options are ORC or parquet files. You will need to untar the file and create a gzip file with just the .txt in it.

Upvotes: 0

Related Questions