How to Get the Largest Key of each Spark Partition?

Question

If we use .reduce(max) then we will get the largest key in the whole RDD. I know this reduce will operate on all partitions and then reduce those items sent by each partition. But how can we get back the largest key of every partition? Write a function for .mapPartitions()?

user6022341 · Accepted Answer

You can:

rdd.mapParitions(iter => Iterator(iter.reduce(Math.max)))

or

rdd.mapPartitions(lambda iter: [max(iter)])

In streaming use this with DStream.trasform.

How to Get the Largest Key of each Spark Partition?

Answers (1)

Related Questions