novice8989
novice8989

Reputation: 169

Spark Scala UDF not returning expected value when the parameters are empty

I have simple UDF which returns a value based on the input parameters and if the parameters are empty its not returning the default case . Appreciate any help in correcting my understanding

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val test = udf((a: Double,b: Double ,c: Boolean) => {
if ((a) >= 6 && !c) {
  { 
      "smith"
   }
}
else if ( (a) >= 20  && !c) {
 "Fred"
}
else if (( (a) < 6 ||  (b) < 2) && !c) {
 "Ross"
}
else {
"NA"
}
})
 
val ds1 = Seq((1,"test",true),
    (2,"test2",false),
    (3,"teste",false)   
  ).toDF("id","name","flag")

val ds2 = Seq((2,6,4),
  (3,0,0)       
  ).toDF("id","flag2","flag3")

var combined= (ds1.as("n")
.join(ds2.as("p"), $"n.id" === $"p.id","left_outer") 
.select
(
$"n.id",
$"n.name",$"n.flag",$"flag2",$"flag3"
))

combined = combined.withColumn("newcol",test($"flag2",$"flag3",$"flag"))
combined.show(5,false)
  1. For the row with Id value =1, udf should return "NA" as its not meeting any of criteria in the UDF but instead its returning null

  2. Also how can I populate empty /null for flag2 and flag3 columns in ds2 . for eg. tried seq(3,null.asInstanceOf[Double],null.asInstanceOf[Double]),got an error

Upvotes: 1

Views: 1046

Answers (2)

Ged
Ged

Reputation: 18023

For your understanding then:

Scala uses Java primitives. Double and Int primitives in Java must have a value, i.e. null is not acceptable. The UDF is therefore not invoked in your case for the 1 entry, as it is seen that these are of Double type - and null, of course in this case. If you understand this, then you should be able to devise a suitable solution.

Upvotes: 1

Partha Deb
Partha Deb

Reputation: 183

The UDF is failing because of null values and it is not executing. It returns null for those cases. Handle the null values in the combined dataframe. One option is to replace the nulls by 0.

val new_combined = combined.na.fill(0).withColumn("newcol",test($"flag2",$"flag3",$"flag"))
new_combined.show(5,false)

+---+-----+-----+-----+-----+------+
|id |name |flag |flag2|flag3|newcol|
+---+-----+-----+-----+-----+------+
|1  |test |true |0    |0    |NA    |
|2  |test2|false|6    |4    |smith |
|3  |teste|false|0    |0    |Ross  |
+---+-----+-----+-----+-----+------+

https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

Upvotes: 1

Related Questions