Manjunath
Manjunath

Reputation: 41

How to join a stream with previous window data?

We receive a few millions of records every 15 minutes. What is the best way to join the current set of records with the previous set of records for the same ids in spark structured streaming? How to reinitialize the previous state after a restart? We have tried HBase to store the previous state, but it turned to be very slow. If we use spark arbitrary sessions, how to reinitialize the previous state after restart? We have implemented this in Kafka streams now. But want to know if there is a way implement in spark structured streaming.

Upvotes: 2

Views: 354

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74759

What is the best way to join the current set of records with the previous set of records for the same ids in spark structured streaming?

Arbitrary stateful flatMapGroupsWithState operator seems the best option.

How to reinitialize the previous state after a restart?

That happens automatically as part of Spark Structured Streaming. That's the purpose of checkpointLocation option (with the state directory underneath). You should not be concerned with these low-level infrastructure bits.

Upvotes: 1

Related Questions