Workaround for joining two streams in structured streaming in Spark 2.x

Question

I have a stream of configurations (not changed often, but if there's an update, it will be a message), and another stream of raw data points.

As I understand that for now spark doesn't support joining to streaming datasets or dataframes. Is there a good way to workaround this?

Is it possible to "snapshot" one of the streaming dataset to a static dataset (probably the configuration one, since it has less updates), then join with the other streaming dataset?

Open to suggestions!

pikapoo · Accepted Answer

So here is what I'm doing at the end.

Put the stream with less updates into a memory sink. Then do a select from the that table. By this time, it is a static instance and can be joined with the other stream. No trigger needed. Of course, you need to update the table correctly by yourself.

This is not very robust, but that's the best one I can come up with before the official support.

Workaround for joining two streams in structured streaming in Spark 2.x

Answers (2)

Related Questions