Reputation: 2994
Is there anyway in spark streaming to keep data across multiple micro-batches of a sorted dstream, where the stream is sorted using timestamps? (Assuming monotonically arriving data) Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream?
What does iteration mean?
I first sort the dstream using timestamps, assuming that data has arrived in a monotonically increasing timestamp (no out-of-order).
I need a global HashMap X, which I would like to be updated using values with timestamp "t1", and then subsequently "t1+1". Since the state of X itself impacts the calculations it needs to be a linear operation. Hence operation at "t1+1" depends on HashMap X, which depends on data at and before "t1".
Application
This is especially the case when one is trying to update a model or compare two sets of RDD's, or keep a global history of certain events etc which will impact operations in future iterations?
I would like to keep some accumulated history to make calculations.. not the entire dataset, but persist certain events which can be used in future DStream RDDs?
Upvotes: 1
Views: 704
Reputation: 7180
UpdateStateByKey
does precisely that: it enables you to define some state, as well as a function to update it based on each RDD in your stream. This is the typical way to accumulate historical computations over time.
From the doc:
The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.
- Define the state - The state can be of arbitrary data type.
- Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.
More info here: https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#updatestatebykey-operation
If this does not cut it or you need more flexibility, you can always store to a key-value store explicitly like Cassandra (cf Cassandra connector: https://github.com/datastax/spark-cassandra-connector), although this option is typically slower since it systematically involves network transfer at every lookup..
Upvotes: 1