Reputation: 18098
The following code and hence a question on performance - imagine of course at scale:
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
// or
def ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).explain
Here under two Physical Plans for when using def and val - which are the same:
== Physical Plan == **def**
*(1) Project [_1#760 AS ID#764, _2#761 AS a#765, _3#762 AS b#766, (CASE WHEN (_2#761 = 1) THEN 1 ELSE 0 END + CASE WHEN (_3#762 = 1) THEN 1 ELSE 0 END) AS ones#770]
+- *(1) SerializeFromObject [staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#760, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#761, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#762]
+- Scan[obj#759]
== Physical Plan == **val**
*(1) Project [_1#780 AS ID#784, _2#781 AS a#785, _3#782 AS b#786, (CASE WHEN (_2#781 = 1) THEN 1 ELSE 0 END + CASE WHEN (_3#782 = 1) THEN 1 ELSE 0 END) AS ones#790]
+- *(1) SerializeFromObject [staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple3, true])._1, true, false) AS _1#780, assertnotnull(input[0, scala.Tuple3, true])._2 AS _2#781, assertnotnull(input[0, scala.Tuple3, true])._3 AS _3#782]
+- Scan[obj#779]
So, there is the discussion on:
val vs def performance.
Then:
I see no difference in the .explains. OK.
From elsewhere: val evaluates when defined, def - when called.
To the -1er I ask thus this as the following is very clear, but the val ones has more to it than code below and the below is not iterated:
var x = 2 // using var as I need to change it to 3 later
val sq = x*x // evaluates right now
x = 3 // no effect! sq is already evaluated
println(sq)
Upvotes: 3
Views: 5531
Reputation: 149598
There are two core concepts at hand here, Spark DAG creation and evaluation, and Scala's val
vs def
definitions, these are orthogonal
I see no difference in the .explains
You see no difference because from Spark's perspective, the query is the same. It doesn't matter to the analyser if you store the graph in a val
or create it each time with a def
.
From elsewhere: val evaluates when defined, def - when called.
This is Scala semantics. A val
is an immutable reference which gets evaluated once at the declaration site. A def
stands for method definition, and if you allocate a new DataFrame
inside it, it will create one each time you call it. For example:
def ones =
df
.schema
.map(c => c.name)
.drop(1)
.map(x => when(col(x) === 1, 1).otherwise(0))
.reduce(_ + _)
val firstcall = ones
val secondCall = ones
The code above will build two separate DAGs over the DF.
I am assuming that it makes no difference whether a val or def is used here as it essentially within a loop and there is a reduce. Is this correct?
I'm not sure which loop you're talking about, but see my answer above for the distinction between the two.
Will df.schema.map(c => c.name).drop(1) be executed per dataframe row? There is of course no need. Does Catalyst optimize this?
No, drop(1)
will happen for the entire data frame, which will essentially make it drop the first row only.
If the above is true in that the statement is executed every time for the columns to process, how can we make that piece of code occur just once? Should we make a val of val ones = df.schema.map(c => c.name).drop(1)
It does occur only once per data frame (which in your example we have exactly one of).
Upvotes: 8
Reputation: 3470
The ones
expression won't get evaluated per dataframe row, it will be evaluated once. def
get's evaluated per call. For example, if there are 3 dataframe
s using that ones
expression, then the ones
expression will be evaluated 3 times. The difference between val
is, is that the expression would only be evaluated once.
Basically, the ones
expression creates an instance of org.apache.spark.sql.Column
where org.apache.spark.sql.Column = (CASE WHEN (a = 1) THEN 1 ELSE 0 END + CASE WHEN (b = 1) THEN 1 ELSE 0 END)
. If the expression is a def
then a new org.apache.spark.sql.Column
is instantiated every time it is called. If the expression is a val
, then the same instance is used over and over.
Upvotes: 1