Reputation: 74679
I've been exploring the whole-stage code generation optimization in Spark SQL (aka whole-stage codegen) and been wondering how much "stage" in "while-stage" is from Spark Core's meaning of a stage (of a Spark job)?
Is there any technical relationship between the stages in whole-stage codegen in Spark SQL and Spark Core? Or are they used more broadly to refer to a "phase" in a computation?
Upvotes: 3
Views: 3382
Reputation: 553
The concepts are quite similar but not always same.
In Spark Core Stage corresponds to a group of operators within a shuffle boundary.
The explain() function in the expression below has been extended for whole-stage code generation. In the explain output, when an operator has a star around it (*), whole-stage code generation is enabled. In the following case, Range, Filter, and the two Aggregates are both running with whole-stage code generation. Exchange, however, does not implement whole-stage code generation because it is sending data across the network.
spark.range(1000).filter("id > 100").selectExpr("sum(id)").explain()
== Physical Plan ==
*Aggregate(functions=[sum(id#201L)])
+- Exchange SinglePartition, None
+- *Aggregate(functions=[sum(id#201L)])
+- *Filter (id#201L > 100)
+- *Range 0, 1, 3, 1000, [id#201L]
In case of Whole-Stage-Codegen, CollapseCodegenStages physical preparation rule is used to find the plans that support codegen and collapse them together as WholeStageCodegen.
Kindly go through the following links to get a better idea.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-CollapseCodegenStages.html
Upvotes: 6
Reputation: 141
The term (and concept) of “stage” is the same in RDD execution, SQL/DataFrame execution and "Wholestage Codegen."
Stage refers to all of the narrow (map) operations from a read (from external source or previous shuffle) to the subsequent write (to the next shuffle or final output location like filesystem, database, etc.)
With Wholestage Codegen, when possible, each physical operator produces some code and they are "fused together" (based on a common pattern like Lego bricks) to make one big Java function that gets compiled -- call it "f"
The execution is then done by (roughly) taking an RDD[InternalRow]
with the needed fields/columns and doing a rdd.mapPartitions(f)
Another way to see it is in the SQL UI tab: when “full” Wholestage Codegen is achieved, the blue outer codegen boxes cover everything except the Exchange boxes (the physical shuffle)
Upvotes: 9