Georg Heiler
Georg Heiler

Reputation: 17676

Dataframe null values transformed to 0 after UDF. Why?

How can Nulls be handled when accessing dataframe Row values? Does the Null pointer Exception really require to be handled manually? There must be a better solution.

case class FirstThing(id:Int, thing:String, other:Option[Double])

val df = Seq(FirstThing(1, "first", None), FirstThing(1, "second", Some(2)), FirstThing(1, "third", Some(3))).toDS
df.show

val list = df.groupBy("id").agg(collect_list(struct("thing", "other")).alias("mylist"))
list.show(false)

This fails with NPE:

val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getDouble(1)))
list.withColumn("aa", xxxx(col("mylist"))).show(false)

This strangely gives 0:

val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getAs[Double]("other")))
list.withColumn("aa", xxxx(col("mylist"))).show(false)

+---+-----------------------------------------+---------------+
|id |mylist                                   |aa             |
+---+-----------------------------------------+---------------+
|1  |[[first,null], [second,2.0], [third,3.0]]|[0.0, 2.0, 3.0]|
+---+-----------------------------------------+---------------+

Sadly this approach which works fine with data frames/datasets fails as well:

val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getAs[Option[Double]]("other")))
list.withColumn("aa", xxxx(col("mylist"))).show(false)

ClassCastException: java.lang.Double cannot be cast to scala.Option

Upvotes: 3

Views: 344

Answers (1)

Shaido
Shaido

Reputation: 28322

Using getAs[Double] and wrap it in an Option will give the expected result:

val xxxx = udf((t: Seq[Row])=> t.map(elem => Option(elem.getAs[Double]("other"))))
list.withColumn("aa", xxxx($"mylist")).show(false)

+---+-----------------------------------------+----------------+
|id |mylist                                   |aa              |
+---+-----------------------------------------+----------------+
|1  |[[first,null], [second,2.0], [third,3.0]]|[null, 2.0, 3.0]|
+---+-----------------------------------------+----------------+

The reason that getAs[Option[Double]] does not work could be that the dataframe schema does not keep the knowledge that the column have options. Schema before udf:

root
 |-- id: integer (nullable = false)
 |-- mylist: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- thing: string (nullable = true)
 |    |    |-- other: double (nullable = true)

Upvotes: 2

Related Questions