ganesh hegde
ganesh hegde

Reputation: 11

How to pass custom function to reduceByKey of RDD in scala

My requirement is to find the maximum of each group in RDD.

I tried the below;

scala> val x = sc.parallelize(Array(Array("A",3), Array("B",5), Array("A",6)))
x: org.apache.spark.rdd.RDD[Array[Any]] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> x.collect
res0: Array[Array[Any]] = Array(Array(A, 3), Array(B, 5), Array(A, 6))          

scala> x.filter(math.max(_,_))
<console>:30: error: wrong number of parameters; expected = 1
              x.filter(math.max(_,_))
                               ^

I also tried the below; Option 1:

scala> x.filter((x: Int, y: Int) => { math.max(x,y)} )
<console>:30: error: type mismatch;
 found   : (Int, Int) => Int
 required: Array[Any] => Boolean
              x.filter((x: Int, y: Int) => { math.max(x,y)} )

Option 2:

scala> val myMaxFunc = (x: Int, y: Int) => { math.max(x,y)}
myMaxFunc: (Int, Int) => Int = <function2>

scala> myMaxFunc(56,12)
res10: Int = 56

scala> x.filter(myMaxFunc(_,_) )
<console>:32: error: wrong number of parameters; expected = 1
              x.filter(myMaxFunc(_,_) )

How to get this right ?

Upvotes: 0

Views: 1447

Answers (1)

stholzm
stholzm

Reputation: 3455

I can only guess, but probably you want to do:

val rdd = sc.parallelize(Array(("A", 3), ("B", 5), ("A", 6)))
val max = rdd.reduceByKey(math.max)
println(max.collect().toList)  // List((B,5), (A,6))

Instead of "How to get this right ?" you should have explained what your expected result is. I think you made a few mistakes:

  • using filter instead of reduceByKey (why??)
  • reduceByKey only works on PairRDDs, so you need tuples instead of Array[Any] (which is a bad type anyways)
  • you do not need to write your own wrapper function for math.max, you can just use it as-is

Upvotes: 1

Related Questions