Misunderstanding of spark RDD fault tolerant

Many say:

Spark does not replicate data in hdfs.

Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph. So there is no need of data replication as the RDDs can be recalculated from the lineage graph.

And my question is:

If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?What if the RDD that lost some partitions didn't have parent RDD（like the RDD is from spark streaming receiver） ?

Upvotes: 5

Answers (2)

cabeer

Reputation: 115

If a node fails, spark will only recompute the RDD partitions lost on this node, but where does the data source needed in the recompution process come from ? Do you mean its parent RDD is still there when the node fails?

The core idea is that you can use the lineage to recover lost RDDs because RDDs are

built from another RDD or
built from data in stable storage.

(source: RDD paper, beginning of section 2.1)

If some RDD is lost, you can just go back in the lineage until you reach some RDD or the initial data record that is still available.

The data in stable storage is replicated across multiple nodes, therefore unlikely to be lost.

As far from what I've read about Streaming Receivers, the received data seems to be saved in stable storage as well, so it behaves just like any other data source.

Upvotes: 1

gsamaras

Reputation: 73444

What if we lose something part way through computation?

Rely on the key insight from MR! Determinism provides safe recompute.
Track 'lineage' of each RDD. Can recompute from parents if needed.

Interesting: only need to record tiny state to do recompute.

Need parent pointer, function applied, and a few other bits.
Log 10 KB per transform rather than re-output 1 TB -> 2 TB

Source

The child RDD is metadata that describes how to calculate the RDD from the parent RDD. Read more in What is RDD dependency in Spark?

Upvotes: 2

Misunderstanding of spark RDD fault tolerant

Answers (2)

Related Questions