Reputation: 31
I have a following User defined function in scala
val returnKey: UserDefinedFunction = udf((key: String) => {
val abc: String = key
abc
})
Now, I want to unit test whether it is returning correct or not. How do I write the Unit test for it. This is what I tried.
class CommonTest extends FunSuite with Matchers {
test("Invalid String Test") {
val key = "Test Key"
val returnedKey = returnKey(col(key));
returnedKey should equal (key);
}
But since its a UDF the returnKey is a UDF function. I am not sure how to call it or how to test this particular scenario.
Upvotes: 1
Views: 2673
Reputation: 652
A UserDefinedFunction
is effectively a wrapper around your Scala function that can be used to transform Column
expressions. In other words, the UDF given in the question wraps a function of String => String
to create a function of Column => Column
.
I usually pick 1 of 2 different approaches to testing UDFs.
// In your test
val testDF = Seq("Test Key", "", null).toDS().toDF("s")
val result = testDF.select(returnKey(col("s"))).as[String].collect.toSet
result should be(Set("Test Key", "", null))
Notice that this lets us test all our edge cases in a single spark plan. In this case, I have included tests for the empty string and null
.
def returnKeyImpl(key: String) = {
val abc: String = key
abc
}
val returnKey = udf(returnKeyImpl _)
Now we can test returnKeyImpl
by passing in strings and checking the string output.
There is a trade-off between these two approaches, and my recommendation is different depending on the situation.
If you are doing a larger test on bigger datasets, I would recommend using testing the UDF in a Spark job.
Testing the UDF in a Spark job can raise issues that you wouldn't catch by only testing the underlying Scala function. For example, if your underlying Scala function relies on a non-serializable object, then Spark will be unable to broadcast the UDF to the workers and you will get an exception.
On the other hand, starting spark jobs in every unit test for every UDF can be quite slow. If you are only doing a small unit test, it will likely be faster to just test the underlying Scala function.
Upvotes: 4