Reputation: 18475
What is the meaning of the asterisk symbol at the beginning of a line in the logical plan?
See example below:
| == Physical Plan ==
*(2) HashAggregate(keys=[k#33], functions=[sum(cast(v#34 as bigint))])
+- Exchange hashpartitioning(k#33, 200), true, [id=#59]
+- *(1) HashAggregate(keys=[k#33], functions=[partial_sum(cast(v#34 as bigint))])
+- *(1) LocalTableScan [k#33, v#34] |
I remember that it has something to do with that operation being optimized but was not able to find something helpful in the docs and also do not know further details apart from "optimized".
Upvotes: 1
Views: 686
Reputation: 18475
Edit:
Finally found an online source documenting what the star means in the logical plan:
"Steps in the physical plan subject to whole stage code generation optimization, are prefixed by a star followed by the code generation id, for example: ‘*(1) LocalTableScan’."
Continued my research offline and found some valuable hints in the Spark Developer training material from Hortonworks and the book "Learning Spark, 2nd edition".
Operations with asterisk (*) use Whole-Stage Code Gen.
Whereas Whole-stage Code Generation is described as
"a physical query optimization phase that collapses the whole query into a single function, getting rid of virtual function calls and employing CPU registers for intermediate data."
A comprehensive descprition of Whole-Stage Code Gen is given in the Databricks article Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop where it says:
"The goal is to leverage whole-stage code generation so the engine can achieve the performance of hand-written code, yet provide the functionality of a general purpose engine. Rather than relying on operators for processing data at runtime, these operators together generate code at runtime and collapse each fragment of the query, where possible, into a single function and execute that generated code instead."
Upvotes: 2