Reputation: 13845
I am learning Apache Spark and trying to get the lineage graph of the RDDs. But i could not find when does a particular lineage is created? Also, where to find the lineage of an RDD?
Upvotes: 1
Views: 1423
Reputation: 74679
RDD Lineage is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD.
Note the part "logical" not "physical" that happens after you've executed an action.
Quoting Mastering Apache Spark 2 gitbook:
RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.
A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))
You can check out the RDD lineage of an RDD using RDD.toDebugString:
toDebugString: String A description of this RDD and its recursive dependencies for debugging.
val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []
val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
+-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
| MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
Upvotes: 2