Reputation: 63
I am working on a requirement of a displaying a Real-time dashboard based on the some aggregations computed on the input data.
I have just started to explore Spark/Spark Streaming and I see we can compute in real-time using Spark Integration in micro batches and provide the same to the UI Dashboard.
My query is, if at anytime after the Spark Integration job is started, it is stopped/or crashes and when it comes up how will it resume from the position it was last processing. I understand Spark maintains a internal state and we update that state for every new data we receive. But, wouldn't that state be gone when it is restarted.
I feel we may have to periodically persist the running total/result so as to enable Spark to resume its processing by fetching it from there when it restarted again. But, not sure how I can do that with Spark Streaming .
But, not sure if Spark Streaming by default ensures that the data is not lost,as I have just started using it.
If anyone has faced a similar scenario, can you please provide your thoughts on how I can address this.
Upvotes: 0
Views: 273
Reputation: 670
spark.streaming.receiver.writeAheadLog.enable true
checkpoint is to write your app state to reliable storage periodically. And when your application fails, it can recover from checkpoint file. To write checkpoint, write this:
ssc.checkpoint("checkpoint.path")
To read from checkpoint:
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getOrCreate("checkpoint_path", () => createContext())
ssc.start()
ssc.awaitTermination()
}
in the createContext
function, you should create ssc and do your own logic. For example:
def createContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName("app.name")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
val ssc = new StreamingContext(conf, Seconds("streaming.interval"))
ssc.checkpoint("checkpoint.path")
// your code here
ssc
}
Here is the document about necessary steps about how to deploy spark streaming applications, including recover from driver/executor failure.
https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#deploying-applications
Upvotes: 1
Reputation: 1982
Spark Streaming acts as a consumer application. In real time,data being pulled from Kafka topics where you can store the offset of data in some data store. This is also true if you are reading data from Twitter streams. You can follow the below posts to store the offset and if the application crashed or restarted.
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
https://www.linkedin.com/pulse/achieving-exactly-once-semantics-kafka-application-ishan-kumar
Upvotes: 0