Reputation: 1767
Couple of options I can think of
Not sure which is better. I'm not clear on how to easily translate the redshift schema into something parquet could intake but maybe the spark connector will take care of that for me.
Upvotes: 2
Views: 3109
Reputation: 311
Spark is not needed anymore. We can unload Redshift data to S3 in Parquet format directly. The sample code:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
You will be able to find more at UNLOAD - Amazon Redshift
Upvotes: 4
Reputation: 5782
Get the Redshift JDBC jar and use the sparkSession.read.jdbc
with the redshift connection details like this in my example:
val properties = new java.util.Properties()
properties.put("driver", "com.amazon.redshift.jdbc42.Driver")
properties.put("url", "jdbc:redshift://redshift-host:5439/")
properties.put("user", "<username>") properties.put("password",spark.conf.get("spark.jdbc.password", "<default_pass>"))
val d_rs = spark.read.jdbc(properties.get("url").toString, "data_table", properties)
My relevant blog post: http://garrens.com/blog/2017/04/09/connecting-apache-spark-to-external-data-sources/
Spark streaming should be irrelevant in this case.
I would also recommend using databricks spark-redshift package to make the bulk unload from redshift and load into spark much faster.
Upvotes: 1