Reputation: 730
I have an RDD partitioned on the cluster and I want to do reduceByKey
on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster.
The below code does not work but I want sth like this:
myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)})
How can I achieve this?
Upvotes: 1
Views: 587
Reputation: 35229
You could try something
myPairedRDD.mapPartitions(iter =>
iter.groupBy(_._1).mapValues(_.map(_._2).reduce(_ + _)).iterator
)
or to keep things more memory efficient (here I assume that myPairedRDD
is RDD[(String, Double)]
. Please adjust types to match your use case):
myPairedRDD.mapPartitions(iter =>
iter.foldLeft(mutable.Map[String, Double]().withDefaultValue(0.0)){
case (acc, (k, v)) => {acc(k) += v; acc}
}.iterator
)
but please note, that unlike shuffling operations, it cannot offload data from memory.
Upvotes: 2