Reputation: 1023
Apache Spark brags that its operators (nodes) are "stateless". This allows Spark's architecture to use simpler protocols for things like recovery, load balancing, and handling stragglers.
On the other hand Apache Flink describes its operators as "stateful", and claim that statefulness is necessary for applications like machine learning. Yet Spark programs are able to pass forward information and maintain application data in RDDs without maintaining "state".
What is happening here? Is Spark not a true stateless system? Or is Flink's assertion that statefulness is essential for machine learning and similar application incorrect? Or is there some additional nuance here?
I don't feel like I truly grok the difference between "stateful" and "stateless" systems, and I would appreciate if they could be explained.
Upvotes: 6
Views: 4717
Reputation: 149538
The property of state refers to being able to access data from a previous point in time in the current point in time.
What does this mean? Assume I want to do a word count of all words which have arrived to my streaming application. But the nature of streaming is that data flows in and out of the pipeline. In order to be able to access previous data, in this example some kind of map which holds what was the previous number of words in the stream, I have to access some state which was accumulated.
While some of Sparks RDD operators are stateless, such as map
, filter
etc, it does expose stateful operators in the form of mapWithState
. Not only that, in the new Spark streaming architecture, called "Structured Streaming", state is built into the pipeline and mostly abstracted away from the user in order to be able to expose aggregation operators, such as agg
.
Upvotes: 5