Satish Dalal
Satish Dalal

Reputation: 31

How do we write Unit test for UDF in scala

I have a following User defined function in scala

val returnKey: UserDefinedFunction = udf((key: String) => {
    val abc: String = key
    abc
})

Now, I want to unit test whether it is returning correct or not. How do I write the Unit test for it. This is what I tried.

class CommonTest extends FunSuite with Matchers {
    test("Invalid String Test") {
        val key = "Test Key" 
        val returnedKey = returnKey(col(key));
        returnedKey should equal (key);
}

But since its a UDF the returnKey is a UDF function. I am not sure how to call it or how to test this particular scenario.

Upvotes: 1

Views: 2673

Answers (1)

Erp12
Erp12

Reputation: 652

A UserDefinedFunction is effectively a wrapper around your Scala function that can be used to transform Column expressions. In other words, the UDF given in the question wraps a function of String => String to create a function of Column => Column.

I usually pick 1 of 2 different approaches to testing UDFs.

  1. Test the UDF in a spark plan. In other words, create a test DataFrame and apply the UDF to it. Then collect the DataFrame and check its contents.
// In your test
val testDF = Seq("Test Key", "", null).toDS().toDF("s")
val result = testDF.select(returnKey(col("s"))).as[String].collect.toSet
result should be(Set("Test Key", "", null))

Notice that this lets us test all our edge cases in a single spark plan. In this case, I have included tests for the empty string and null.

  1. Extract the Scala function being wrapped by the UDF and test it as you would any other Scala function.
def returnKeyImpl(key: String) = {
    val abc: String = key
    abc 
}

val returnKey = udf(returnKeyImpl _)

Now we can test returnKeyImpl by passing in strings and checking the string output.

Which is better?

There is a trade-off between these two approaches, and my recommendation is different depending on the situation.

If you are doing a larger test on bigger datasets, I would recommend using testing the UDF in a Spark job.

Testing the UDF in a Spark job can raise issues that you wouldn't catch by only testing the underlying Scala function. For example, if your underlying Scala function relies on a non-serializable object, then Spark will be unable to broadcast the UDF to the workers and you will get an exception.

On the other hand, starting spark jobs in every unit test for every UDF can be quite slow. If you are only doing a small unit test, it will likely be faster to just test the underlying Scala function.

Upvotes: 4

Related Questions