Reputation: 109
Am new bee to pyspark and trying to understand the exact use of toDebugstring().can you please explain from below code snippet ?
>>> a = sc.parallelize([1,2,3]).distinct()
>>> print a.toDebugString()
(8) PythonRDD[27] at RDD at PythonRDD.scala:44 [Serialized 1x Replicated]
| MappedRDD[26] at values at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated]
| ShuffledRDD[25] at partitionBy at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated]
+-(8) PairwiseRDD[24] at distinct at <stdin>:1 [Serialized 1x Replicated]
| PythonRDD[23] at distinct at <stdin>:1 [Serialized 1x Replicated]
| ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:358 [Serialized 1x Replicated]
Upvotes: 5
Views: 4222
Reputation: 31
In spark, dependencies in the RDDs are logged in as a graph. In simpler words , every step is part of lineage. By calling the toDebugString method you are essentially asking to get this lineage graph(aka chain of every individual step that happened i.e type of RDD created and method used to create it) to be displayed. Some are the transformation that you executed explicitly whereas others are not( for example the bottom-most step of lineage graph is the real type of RDD you engulfed but just above it is the RDD made by internal mechanism to convert the objects in input RDD to Java Type objects)
Result of your print statement shows every step from bottoms up starting with creation of ParallelCollectionRDD . Each change in indentation is an indication of shuffle boundary i.e occurrence of shuffle operation. You can read more from lineage graphs for better understanding.
Upvotes: 3