Reputation: 6454
I have a Dataset like this:
+----+------+
|code|status|
+-----------+
| 1| "new"|
| 2| null|
| 3| null|
+----+------+
I would like to write a UDF that depends on both columns.
I got it working following the second approach in this answer which is to handle null
outside the UDF, and write myFn
to take a Boolean as the second parameter:
df.withColumn("new_column",
when(pst_regs("status").isNull,
myFnUdf($"code", lit(false))
)
.otherwise(
myFnUdf($"code", lit(true))
)
)
To handle null in the UDF an approach I looked at is per this answer which talks about "wrapping arguments with Options
". I tried code like this:
df.withColumn("new_column", myFnUdf($"code", $"status"))
def myFn(code: Int, status: String) = (code, Option(status)) match {
case (1, "new") => "1_with_new_status"
case (2, Some(_)) => "2_with_any_status"
case (3, None) => "3_no_status"
}
But a row with null
gives type mismatch; found :None.type required String
. I also tried wrapping an argument with Option
during udf creation without success. The basic form of this (without Option) looks like this:
myFnUdf = udf[String, Int, String](myFn(_:Int, _:String))
I'm new to Scala so I'm sure I'm missing something simple. Part of my confusion may be the different syntaxes for creating udfs from functions (e.g. per https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html), so I'm not sure I'm using the best way. Any help appreciated!
EDIT
Edited to add missing (1, "new")
case per @user6910411 and @sgvd comments.
Upvotes: 7
Views: 5962
Reputation: 3939
Firstly, there may be some code you are using that we are missing here. When I try your example myFn
, made into a UDF with val myFnUdf = udf(myFn _)
and run it with df.withColumn("new_column", myFnUdf($"code", $"status")).show
, I don't get a type mismatch, but instead a MatchError
, as also noted by user6910411. This is becausethere is no pattern to match (1, "new")
.
Aside from that, although it is usually better to use Scala's Options rather than raw null
values, you don't have to in this case. The following example works with null
directly:
val my_udf = udf((code: Int, status: String) => status match {
case null => "no status"
case _ => "with status"
})
df.withColumn("new_column", my_udf($"code", $"status")).show
Result:
+----+------+-----------+
|code|status| new_column|
+----+------+-----------+
| 1| new|with status|
| 2| null| no status|
| 2| null| no status|
+----+------+-----------+
Wrapping with options does still work though:
val my_udf = udf((code: Int, status: String) => Option(status) match {
case None => "no status"
case Some(_) => "with status"
})
This gives the same result.
Upvotes: 9