Hoori M.
Hoori M.

Reputation: 730

Doing reduceByKey on each partition of RDD separately without aggregating results

I have an RDD partitioned on the cluster and I want to do reduceByKey on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster.

The below code does not work but I want sth like this:

myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)})

How can I achieve this?

Upvotes: 1

Views: 587

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

You could try something

myPairedRDD.mapPartitions(iter => 
  iter.groupBy(_._1).mapValues(_.map(_._2).reduce(_ + _)).iterator
)

or to keep things more memory efficient (here I assume that myPairedRDD is RDD[(String, Double)]. Please adjust types to match your use case):

myPairedRDD.mapPartitions(iter => 
  iter.foldLeft(mutable.Map[String, Double]().withDefaultValue(0.0)){ 
    case  (acc, (k, v)) => {acc(k) += v; acc}
  }.iterator
)

but please note, that unlike shuffling operations, it cannot offload data from memory.

Upvotes: 2

Related Questions