ReduceByKey when RDD value is a tuple

Question

I am new to Apache Spark, and am not able to get this to work.

I have an RDD of the form (Int,(Int,Int)), and would like to sum up the first element of the value while appending the second element.

For example, I have the following RDD:

[(5,(1,0)), (5,(1,2)), (5,(1,5)))]

And I want to be able to get something like this:

(5,3,(0,2,5))

I tried this:

sampleRdd.reduceByKey{case(a,(b,c)) => (a + b)}

But I get this error:

type mismatch;
[error]  found   : Int
[error]  required: String
[error]     .reduceByKey{case(a,(b,c)) => (a + b)}
[error]                                        ^

How can I achieve this?

Learn Hadoop · Accepted Answer

Please try this

def seqOp = (accumulator: (Int, List[String]), element: (Int, Int)) =>
    (accumulator._1 + element._1, accumulator._2 :+ element._2.toString)

  def combOp = (accumulator1: (Int, List[String]), accumulator2: (Int, List[String])) => {
    (accumulator1._1 + accumulator2._1, accumulator1._2 ::: accumulator2._2)
  }
 
  val zeroVal = ((0, List.empty[String]))

  rdd.aggregateByKey(zeroVal)(seqOp, combOp).collect

ReduceByKey when RDD value is a tuple

Answers (1)

Related Questions