lserlohn
lserlohn

Reputation: 6206

How to take advantage of Spark 2.0 "whole-stage code generation"

I have been reading many articles about Spark 2.0 "whole-stage code generation". Since the technique optimize the code at compiling stage, I have several questions about that:

Q1. Can Python or R take advantage of this technique? Q2. In Scala/Java, How to take advantage of this technique? Should I have to bring all the query using Spark's API, or just a string query is good enough? For example, can each of the following programs taking advantage of the "whole-stage code generation":

case 1:

sparksession.sql("select * from a john b on a.id = b.id")

case 2:

val talbe_a = sparksession.sql("select * from a)
val table_b = sparksession.sql("select * from b)
val table_c = table_a.join(table_b, table_a(COL_ADID) === table_b(COL_ADID))

Q3. If Q2 case 1 is able to utilize "whole-stage code generation", how about we read the query string from external files, like that:

val query = scala.io.Source.fromFile(queryfile).mkString
sparksession.sql(query)

In the above code, the complier really doesn't know what the query string looks like, at the compiling stage, can it utilize the "whole-stage code generation" technique?

Upvotes: 1

Views: 578

Answers (1)

21d21d21d2d
21d21d21d2d

Reputation: 26

  1. All languages using Spark SQL API can benefit from codegen as long as they don't use language specific extensions (Python UDF, dapply, gapply in R)

  2. Both SQL and DataFrame APIs are supported and they way you provide the query doesn't matter. Codegen is internal process applied between user input and query execution.

Upvotes: 1

Related Questions