Reputation: 5125
Many say:
Spark does not replicate data in hdfs.
Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph. So there is no need of data replication as the RDDs can be recalculated from the lineage graph.
And my question is:
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD(like the RDD is from spark streaming receiver) ?
Upvotes: 5
Views: 646
Reputation: 115
If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?
The core idea is that you can use the lineage to recover lost RDDs because RDDs are
(source: RDD paper, beginning of section 2.1)
If some RDD is lost, you can just go back in the lineage until you reach some RDD or the initial data record that is still available.
The data in stable storage is replicated across multiple nodes, therefore unlikely to be lost.
As far from what I've read about Streaming Receivers, the received data seems to be saved in stable storage as well, so it behaves just like any other data source.
Upvotes: 1
Reputation: 73366
What if we lose something part way through computation?
Interesting: only need to record tiny state to do recompute.
Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB
The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?
Upvotes: 2