Reputation: 6364
Apache spark claims it will store the lineages instead of the RDD's itself so that it can recompute in case of a failure. I am wondering how it stores the lineages? For example an RDD can be made of bunch of user provided transformation functions so does it store the "source code of those user provided functions" ?
Upvotes: 1
Views: 628
Reputation: 330193
Simplifying things a little bit RDDs
are recursive data structures which describe lineages. Each RDD
has a set of dependencies and it is computed
in a specific context. Functions which are passed to Spark actions and transformations are first-class objects, can be stored, assigned, passed around and captured as a part of the closure and there is no reason (no to mention means) to store source code.
RDDs belong to the Driver
and are not equivalent to the data. When data is accessed on the workers, RDDs are long gone and the only thing that matters is a given task.
Upvotes: 4