Reputation: 2591
I am trying to see if I can streaming data to Redshift using spark structured streaming (v2.2), I found the spark-redshift
library (https://github.com/databricks/spark-redshift). However, it only works in batch mode. Any other suggestions on how to do it with streaming data? How is the performance for COPY
to Redshift?
Appreciate!
Upvotes: 3
Views: 1566
Reputation: 4354
For low volumes of data (a few rows of data occasionally) it is OK to use:
insert into table ...
update table ...
delete from table ...
commands to maintain redshift data. This is how spark streaming would likely work.
However, for larger volumes you must always: 1) write data to s3, preferably chunked up into 1MB to 1GB files, preferable gzipped. 2) run redshift copy command to load that s3 data into redshift "staging" area 3) run redshift sql to merge the staging data into your target tables.
using this copy method could be hundreds of times more efficient than individual inserts.
This means of course, you really have to run in batch mode.
You can run the batch update every few minutes to keep redshift data latency low.
Upvotes: 3