Reputation: 3111
I have written a method to filter out duplicates from an RDD and decided to write a unit test for the method. Here is my method:
def filterDupes(salesWithDupes: RDD[((String, String), SalesData)]): RDD[((String, String), SalesData)] = {
salesWithDupes.map(salesWithDupes => ((salesWithDupes._2.saleType, salesWithDupes._2.saleDate), salesWithDupes))
.reduceByKey((a, _) => a)
.map(_._2)
}
Since this is my first experience writing a test in Scala I've faced several complexities. Am I correctly passing elements from the list to the filtering method?
Now I'm stuck with how to validate the result that is returned from the method. The only approach I came up with for now is collecting the RDD 's data to a list and then checking its size. Is it the right way?
Here is how I see the logic of the test:
"Sales" should "be filtered" in {
Given("Sales RDD")
val rddWithDupes = sc.parallelize(Seq(
(("metric1", "metric2"), createSale("1", saleType = "Type1", saleDate = "2014-10-12")),
(("metric1", "metric2"), createSale("2", saleType = "Type1", saleDate = "2014-10-12")),
(("metric1", "metric2"), createSale("3", saleType = "Type3", saleDate = "2010-11-01"))
))
When("Sales RDD is filtered")
val filteredResult = SalesProcessor.filterDupes(rddWithDupes).collect.toList
Then("Sales are filtered")
filteredResult.size should be(2)
????
}
Upvotes: 0
Views: 360
Reputation:
The only approach I came up with for now is collecting the RDD 's data to a list and then checking its size. Is it the right way?
Yes, it is. Distributed objects have no meaningful notion of equality, and short of tricks like:
you cannot really compare two RDDs.
There is also another problem - which is non-deterministic nature of shuffling operations (like reduceByKey
). You have to assume, that result can be different with each run and design tests accordingly.
This makes testing quite challenging. In practice, I would rather recommend testing each function used in transformation (avoid untestable anonymous mess) and test only invariants that are guaranteed (size, set of keys, and so on).
Upvotes: 1