samba
samba

Reputation: 3111

Scala Unit test: how to validate the returned RDD

I have written a method to filter out duplicates from an RDD and decided to write a unit test for the method. Here is my method:

  def filterDupes(salesWithDupes: RDD[((String, String), SalesData)]): RDD[((String, String), SalesData)] = {
    salesWithDupes.map(salesWithDupes => ((salesWithDupes._2.saleType, salesWithDupes._2.saleDate), salesWithDupes))
      .reduceByKey((a, _) => a)
      .map(_._2)
  }

Since this is my first experience writing a test in Scala I've faced several complexities. Am I correctly passing elements from the list to the filtering method?

Now I'm stuck with how to validate the result that is returned from the method. The only approach I came up with for now is collecting the RDD 's data to a list and then checking its size. Is it the right way?

Here is how I see the logic of the test:

"Sales" should "be filtered" in {

    Given("Sales RDD")

    val rddWithDupes = sc.parallelize(Seq(
  (("metric1", "metric2"), createSale("1", saleType = "Type1", saleDate = "2014-10-12")),
  (("metric1", "metric2"), createSale("2", saleType = "Type1", saleDate = "2014-10-12")),
  (("metric1", "metric2"), createSale("3", saleType = "Type3", saleDate = "2010-11-01"))
))

    When("Sales RDD is filtered")

    val filteredResult = SalesProcessor.filterDupes(rddWithDupes).collect.toList

    Then("Sales are filtered")
    filteredResult.size should be(2)
    ????
  }

Upvotes: 0

Views: 360

Answers (1)

user10042351
user10042351

Reputation:

The only approach I came up with for now is collecting the RDD 's data to a list and then checking its size. Is it the right way?

Yes, it is. Distributed objects have no meaningful notion of equality, and short of tricks like:

  • checking if size is the same.
  • checking if subtracting a from b is empty
  • checking if subtracting b from a is empty

you cannot really compare two RDDs.

There is also another problem - which is non-deterministic nature of shuffling operations (like reduceByKey). You have to assume, that result can be different with each run and design tests accordingly.

This makes testing quite challenging. In practice, I would rather recommend testing each function used in transformation (avoid untestable anonymous mess) and test only invariants that are guaranteed (size, set of keys, and so on).

Upvotes: 1

Related Questions