Gaurav Dubey
Gaurav Dubey

Reputation: 305

What is Lineage In Spark?

How lineage helps to recompute data?

For example, I'm having several nodes computing data for 30 minutes each. If one fails after 15 minutes, can we recompute data processed in 15 minutes again using lineage without giving 15 minutes again?

Upvotes: 14

Views: 28299

Answers (5)

Manju Gowda
Manju Gowda

Reputation: 1

In conclusion, the Lineage Graph is a key concept in Apache Spark that represents the dependencies between RDDs or DataFrames in a Spark application. The Lineage Graph helps in fault tolerance by reconstructing lost RDDs based on their parent RDDs and their transformations.

Upvotes: 0

eliasah
eliasah

Reputation: 40370

Everything to understand about lineage is in the definition of RDD.

So let's review that :

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure

So there is mainly 2 things to understand:

Unfortunately, these topics are quite long to discuss in a single answer. I recommend you take some time reading them along with this following article about Data Lineage.

And now to answer your question and doubts:

If an executor fails computing your data, after 15 minutes, it will go back to your last checkpoint, whether it's from the source or cache in memory and/or on disk.

Thus, it will not save you those 15 minutes that you have mentioned!

Upvotes: 23

KayV
KayV

Reputation: 13845

When a transformation (map or filter etc.) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation. A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data.

For example, consider the following example

val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()

sc.textFile() and myRdd.filter() do not get executed immediately, it will be executed only when an Action is called on the RDD - here filteredRdd.count().

An Action is used to either save result to some location or to display it. RDD lineage information can also be printed by using the command filteredRdd.toDebugString (filteredRdd is the RDD here). Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

enter image description here

Upvotes: 1

Fury Fazu
Fury Fazu

Reputation: 35

DEF: The Spark lineage graph is the set of dependencies between RDDs • Lineage graphs are maintained for each Spark application separately • The lineage graph is used to re computer RDDs on demand and to recover lost data if parts of a persisted RDD are lost • Note: be careful and do not confuse the lineage graph with the  Actions force the evaluation of all (upstream) transformations in the lineage graph of the RDD they are called on

Upvotes: 0

Spandana r
Spandana r

Reputation: 283

In Spark, Lineage Graph is a dependencies graph in between existing RDD and new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

Source: What is Lineage Graph

Upvotes: 0

Related Questions