How do I export tables from redshift into Parquet format?

Question

Couple of options I can think of

Spark streaming back into s3 using Redshift connector
UNLOAD into S3 gzipped then process with a command line tool

Not sure which is better. I'm not clear on how to easily translate the redshift schema into something parquet could intake but maybe the spark connector will take care of that for me.

Garren S · Accepted Answer

Get the Redshift JDBC jar and use the sparkSession.read.jdbc with the redshift connection details like this in my example:

val properties = new java.util.Properties() 
properties.put("driver", "com.amazon.redshift.jdbc42.Driver") 
properties.put("url", "jdbc:redshift://redshift-host:5439/") 
properties.put("user", "<username>") properties.put("password",spark.conf.get("spark.jdbc.password", "<default_pass>")) 
val d_rs = spark.read.jdbc(properties.get("url").toString, "data_table", properties)

My relevant blog post: http://garrens.com/blog/2017/04/09/connecting-apache-spark-to-external-data-sources/

Spark streaming should be irrelevant in this case.

I would also recommend using databricks spark-redshift package to make the bulk unload from redshift and load into spark much faster.

How do I export tables from redshift into Parquet format?

Answers (2)

Related Questions