KayV
KayV

Reputation: 13845

When does a RDD lineage is created? How to find lineage graph?

I am learning Apache Spark and trying to get the lineage graph of the RDDs. But i could not find when does a particular lineage is created? Also, where to find the lineage of an RDD?

Upvotes: 1

Views: 1423

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74679

RDD Lineage is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD.

Note the part "logical" not "physical" that happens after you've executed an action.

Quoting Mastering Apache Spark 2 gitbook:

RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))

You can check out the RDD lineage of an RDD using RDD.toDebugString:

toDebugString: String A description of this RDD and its recursive dependencies for debugging.

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []

val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
 +-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
    |  MapPartitionsRDD[1] at map at <console>:25 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

Upvotes: 2

Related Questions