Reputation: 77
I have a stream of configurations (not changed often, but if there's an update, it will be a message), and another stream of raw data points.
As I understand that for now spark doesn't support joining to streaming datasets or dataframes. Is there a good way to workaround this?
Is it possible to "snapshot" one of the streaming dataset to a static dataset (probably the configuration one, since it has less updates), then join with the other streaming dataset?
Open to suggestions!
Upvotes: 2
Views: 2361
Reputation: 77
So here is what I'm doing at the end.
Put the stream with less updates into a memory sink. Then do a select from the that table. By this time, it is a static instance and can be joined with the other stream. No trigger needed. Of course, you need to update the table correctly by yourself.
This is not very robust, but that's the best one I can come up with before the official support.
Upvotes: 0
Reputation: 16086
"Workaround" is to use current master branch ;)
It's not released yet, but current master branch already has stream-stream inner join and there is outer join in progress. See this Jira ticket for reference, in sub-task you see possible joins to use.
There's no other easy workaround. Streaming joins requires saving state of streams and then correct updates of state. You can see code in pull requests, it's quite complex to implement stream-stream join.
Upvotes: 3