Raphael Roth
Raphael Roth

Reputation: 27383

Does spark not check UDF / Column type?

Consider the following Spark 2.1 code:

val df = Seq("Raphael").toDF("name")
df.show()    

+-------+
|   name|
+-------+
|Raphael|
+-------+

val squareUDF = udf((d:Double) => Math.pow(d,2))

df.select(squareUDF($"name")).show

+---------+
|UDF(name)|
+---------+
|     null|
+---------+

Why do I get null? I was expecting something like a ClassCastException because I try to map a String on a Scala Double

"Raphael".asInstanceOf[Double]

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double

Upvotes: 2

Views: 873

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

It is easy to figure out if you check the execution plan:

scala> df.select(squareUDF($"name")).explain(true)
== Parsed Logical Plan ==
'Project [UDF('name) AS UDF(name)#51]
+- AnalysisBarrier Project [value#36 AS name#38]

== Analyzed Logical Plan ==
UDF(name): double
Project [if (isnull(cast(name#38 as double))) null else UDF(cast(name#38 as double)) AS UDF(name)#51]
+- Project [value#36 AS name#38]
   +- LocalRelation [value#36]

== Optimized Logical Plan ==
LocalRelation [UDF(name)#51]

== Physical Plan ==
LocalTableScan [UDF(name)#51]

As you can see Spark performs type casting before applying UDF:

UDF(cast(name#38 as double))

and SQL casts does't throw exceptions for type compatible casts. If actual cast is impossible the value is undefined (NULL). If types were incompatible:

Seq((1, ("Raphael", 42))).toDF("id", "name_number").select(squareUDF($"name_number"))
// org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(name_number)' due to data type mismatch: argument 1 requires double type, however, '`name_number`' is of struct<_1:string,_2:int> type.;;
// 
// at org.apache...

you'd get an exception.

If types where incompatible

The rest is covered by:

if (isnull(cast(name#38 as double))) null 

Since value is null udf is never called.

Upvotes: 3

Related Questions