Anyone try to streaming data to Redshift using spark structured streaming

Question

I am trying to see if I can streaming data to Redshift using spark structured streaming (v2.2), I found the spark-redshift library (https://github.com/databricks/spark-redshift). However, it only works in batch mode. Any other suggestions on how to do it with streaming data? How is the performance for COPY to Redshift?

Appreciate!

Jon Scott · Accepted Answer

For low volumes of data (a few rows of data occasionally) it is OK to use:

insert into table ...
update table ...
delete from table ...

commands to maintain redshift data. This is how spark streaming would likely work.

However, for larger volumes you must always: 1) write data to s3, preferably chunked up into 1MB to 1GB files, preferable gzipped. 2) run redshift copy command to load that s3 data into redshift "staging" area 3) run redshift sql to merge the staging data into your target tables.

using this copy method could be hundreds of times more efficient than individual inserts.

This means of course, you really have to run in batch mode.

You can run the batch update every few minutes to keep redshift data latency low.

Anyone try to streaming data to Redshift using spark structured streaming

Answers (1)

Related Questions