Reputation: 128
Is there a difference in performance if you write Spark applications via method chains vs SparkSQL? I know writing codes using methods is more flexible but I'm not sure about the performance between the two.
Example:
spark.select().filter().etc....
versus
spark.sql("<insert query here>")
Upvotes: 0
Views: 42
Reputation: 1416
There is no difference in performance between
df.select($"some_col").filter($"filter_col" === "somevalue")
and
spark.sql("select some_col from some_table where filter_col = 'somevalue'")
The spark plan that gets generated for both the cases is the same. Out of these, which to choose is completely subjective.
You can check the spark plan by running:
df.queryExecution.sparkPlan
Further reads on Spark plan :
https://dzone.com/articles/understanding-optimized-logical-plan-in-spark https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Upvotes: 1