Teja
Teja

Reputation: 13524

Offloading data files from Amazon Redshift to Amazon S3 in Parquet format

I would like to unload data files from Amazon Redshift to Amazon S3 in Apache Parquet format inorder to query the files on S3 using Redshift Spectrum. I have explored every where but I couldn't find anything about how to offload the files from Amazon Redshift to S3 using Parquet format. Is this feature not supported yet or was I not able to find any documentation about it. Could somebody who has worked on it share some light on this? Thank you.

Upvotes: 12

Views: 16555

Answers (5)

secdatabase
secdatabase

Reputation: 181

Redshift Unload to Parquet file format is supported as of Dec 2019:

UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET

It is mentioned in Redshift Features

and also updated in Unload Document

with an example provided in the Unload Examples Document

Excerpt of the official documentation:

The following example unloads the LINEITEM table in Parquet format, partitioned by the l_shipdate column.

unload ('select * from lineitem')
to 's3://mybucket/lineitem/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
PARQUET
PARTITION BY (l_shipdate);

Assuming four slices, the resulting Parquet files are dynamically partitioned into various folders.

s3://mybucket/lineitem/l_shipdate=1992-01-02/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-03/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet
s3://mybucket/lineitem/l_shipdate=1992-01-04/0000_part_00.parquet
                                             0001_part_00.parquet
                                             0002_part_00.parquet
                                             0003_part_00.parquet

Upvotes: 18

Kirk Broadhurst
Kirk Broadhurst

Reputation: 28698

You can't do this. Redshift doesn't know about Parquet (although you can read Parquet files through the Spectrum abstraction).

You can UNLOAD to text files. They can be encrypted or zipped, but they are only ever flat text files.


Looks like this is now supported:

https://aws.amazon.com/about-aws/whats-new/2018/06/amazon-redshift-can-now-copy-from-parquet-and-orc-file-formats/

Upvotes: 3

Mukund
Mukund

Reputation: 946

A great solution 'Spectrify' does this but if you don't want to do it using the AWS Services; you could use Spark on EMR + Databricks to read data from Redshift and write it into S3 in parquet format.

The following link will give you an idea to do the same

https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Upvotes: 0

Colin Nichols
Colin Nichols

Reputation: 714

A bit late, but Spectrify does exactly this.

Upvotes: 7

Tanmoy Bhattacharjee
Tanmoy Bhattacharjee

Reputation: 1080

Unfortunately, so far, AWS Redshift did not extend its ability to read the parquet format.

Though you can do one of the following :

  1. Use AWS Spectrum to read them.
  2. Use a crawler from Amazon Glue to convert it for you.

Till today, there is no support for Apache Parquet in AWS out of the shelf.

I hope this helps.

Upvotes: 1

Related Questions