Reputation: 41
We receive a few millions of records every 15 minutes. What is the best way to join the current set of records with the previous set of records for the same ids in spark structured streaming? How to reinitialize the previous state after a restart? We have tried HBase to store the previous state, but it turned to be very slow. If we use spark arbitrary sessions, how to reinitialize the previous state after restart? We have implemented this in Kafka streams now. But want to know if there is a way implement in spark structured streaming.
Upvotes: 2
Views: 354
Reputation: 74759
What is the best way to join the current set of records with the previous set of records for the same ids in spark structured streaming?
Arbitrary stateful flatMapGroupsWithState operator seems the best option.
How to reinitialize the previous state after a restart?
That happens automatically as part of Spark Structured Streaming. That's the purpose of checkpointLocation
option (with the state
directory underneath). You should not be concerned with these low-level infrastructure bits.
Upvotes: 1