How to take advantage of Spark 2.0 "whole-stage code generation"

Question

I have been reading many articles about Spark 2.0 "whole-stage code generation". Since the technique optimize the code at compiling stage, I have several questions about that:

Q1. Can Python or R take advantage of this technique? Q2. In Scala/Java, How to take advantage of this technique? Should I have to bring all the query using Spark's API, or just a string query is good enough? For example, can each of the following programs taking advantage of the "whole-stage code generation":

case 1:

sparksession.sql("select * from a john b on a.id = b.id")

case 2:

val talbe_a = sparksession.sql("select * from a)
val table_b = sparksession.sql("select * from b)
val table_c = table_a.join(table_b, table_a(COL_ADID) === table_b(COL_ADID))

Q3. If Q2 case 1 is able to utilize "whole-stage code generation", how about we read the query string from external files, like that:

val query = scala.io.Source.fromFile(queryfile).mkString
sparksession.sql(query)

In the above code, the complier really doesn't know what the query string looks like, at the compiling stage, can it utilize the "whole-stage code generation" technique?

21d21d21d2d · Accepted Answer

All languages using Spark SQL API can benefit from codegen as long as they don't use language specific extensions (Python UDF, dapply, gapply in R)
Both SQL and DataFrame APIs are supported and they way you provide the query doesn't matter. Codegen is internal process applied between user input and query execution.

How to take advantage of Spark 2.0 "whole-stage code generation"

Answers (1)

Related Questions

How to take advantage of Spark 2.0 &quot;whole-stage code generation&quot;

Answers (1)

Related Questions

How to take advantage of Spark 2.0 "whole-stage code generation"