javamonkey79
javamonkey79

Reputation: 17775

spark sql where clause after select

Consider the following spark-sql query:

  Seq(("b", 2), ("d", 4), ("a", 1), ("c", 3))
  .toDF("letter", "number")
  .select($"letter")
  .where($"number" > 1)
  .show

The original query can even be split up, and the behavior remains:

val letters =
  Seq(("b", 2), ("d", 4), ("a", 1), ("c", 3))
  .toDF("letter", "number")
  .select($"letter")

letters
  .where($"number" > 1)
  .show

This looks to be related to lazy loading, but, I'm not sure exactly what is happening here.

Why is it possible to include $"number" in the where clause, when only letter should remain?

EDIT 1 Here is the explain:

letters.explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias('letter, None)]
+- Project [_1#76942 AS letter#76955, _2#76943 AS number#76956]
   +- LocalRelation [_1#76942, _2#76943]

== Analyzed Logical Plan ==
letter: string
Project [letter#76955]
+- Project [_1#76942 AS letter#76955, _2#76943 AS number#76956]
   +- LocalRelation [_1#76942, _2#76943]

== Optimized Logical Plan ==
LocalRelation [letter#76955]

== Physical Plan ==
LocalTableScan [letter#76955]

Upvotes: 2

Views: 480

Answers (1)

Ged
Ged

Reputation: 18053

Inherent to the Spark approach, fusing of code within an Action / Job, within a Stage --> narrow transformations.

Spark will optimize the code. Many examples of this.

val rdd1 = ...
val rdd2 = rdd1.map(...
val rdd3 = rdd2.map(...

When the Action occurs there may not well be even an rdd2, rdd1 due to optimizing and fusing of code, in this trivial example.

In your case, you can fuse it all together and just get a simple Local Table Scan.

You can look at https://spoddutur.github.io/spark-notes/second_generation_tungsten_engine.html to give an idea of fusing code aka Whole Stage Code Generation.

Upvotes: 2

Related Questions