tsar2512
tsar2512

Reputation: 2994

Iterative programming on an ordered spark stream using Java in Spark Streaming?

Is there anyway in spark streaming to keep data across multiple micro-batches of a sorted dstream, where the stream is sorted using timestamps? (Assuming monotonically arriving data) Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream?

What does iteration mean?

I first sort the dstream using timestamps, assuming that data has arrived in a monotonically increasing timestamp (no out-of-order).

I need a global HashMap X, which I would like to be updated using values with timestamp "t1", and then subsequently "t1+1". Since the state of X itself impacts the calculations it needs to be a linear operation. Hence operation at "t1+1" depends on HashMap X, which depends on data at and before "t1".

Application

This is especially the case when one is trying to update a model or compare two sets of RDD's, or keep a global history of certain events etc which will impact operations in future iterations?

I would like to keep some accumulated history to make calculations.. not the entire dataset, but persist certain events which can be used in future DStream RDDs?

Upvotes: 1

Views: 704

Answers (1)

Svend
Svend

Reputation: 7180

UpdateStateByKey does precisely that: it enables you to define some state, as well as a function to update it based on each RDD in your stream. This is the typical way to accumulate historical computations over time.

From the doc:

The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.

  1. Define the state - The state can be of arbitrary data type.
  2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.

More info here: https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#updatestatebykey-operation

If this does not cut it or you need more flexibility, you can always store to a key-value store explicitly like Cassandra (cf Cassandra connector: https://github.com/datastax/spark-cassandra-connector), although this option is typically slower since it systematically involves network transfer at every lookup..

Upvotes: 1

Related Questions