Doing reduceByKey on each partition of RDD separately without aggregating results

Question

I have an RDD partitioned on the cluster and I want to do reduceByKey on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster.

The below code does not work but I want sth like this:

myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)})

How can I achieve this?

Doing reduceByKey on each partition of RDD separately without aggregating results

Answers (1)

Related Questions