user1870400
user1870400

Reputation: 6364

How does Apache Spark store lineages?

Apache spark claims it will store the lineages instead of the RDD's itself so that it can recompute in case of a failure. I am wondering how it stores the lineages? For example an RDD can be made of bunch of user provided transformation functions so does it store the "source code of those user provided functions" ?

Upvotes: 1

Views: 628

Answers (1)

zero323
zero323

Reputation: 330193

Simplifying things a little bit RDDs are recursive data structures which describe lineages. Each RDD has a set of dependencies and it is computed in a specific context. Functions which are passed to Spark actions and transformations are first-class objects, can be stored, assigned, passed around and captured as a part of the closure and there is no reason (no to mention means) to store source code.

RDDs belong to the Driver and are not equivalent to the data. When data is accessed on the workers, RDDs are long gone and the only thing that matters is a given task.

Upvotes: 4

Related Questions