Reputation: 173
The organisation I am working on is moving to public cloud from their old traditional way of execution. we have to pay for all executions that takes place over the cloud. for reducing this execution cost, we are doing two things:
As a big data engineer, my work mostly depends on SparkSQL and I am trying to reduce the SQL query execution time. what catalyst do at execution time, I want to do that before the execution. for ex- reading the logical plan, optimizing the logical plan and generating the physical plan etc. I also want to add my custom optimization plans in catalyst which will also be triggered at build time.
Is there any way to do all this before execution?
Upvotes: 2
Views: 546
Reputation: 1480
You can actually get the execution plan for your query by creating the dataframe and not performing any action.
Suppose you have a DataFrame df
, you can access df.logicalPlan
and traverse over the plan. This might answer your first requirement of avoiding bad execution, if you have some heuristic method to detect it.
As for custom optimizations, you can add your own optimization rules (see https://www.waitingforcode.com/apache-spark-sql/introduction-custom-optimization-apache-spark-sql/read). This does not trigger at build time, rather at execution time (like all catalyst optimizations)
Upvotes: 1