Spark RDD operation like top returning a smaller RDD

Question

I am looking for a Spark RDD operation like top or takeOrdered, but that returns another RDD, not an Array, that is, does not collect the full result to RAM.

It can be a sequence of operations, but ideally, in no step trying to collect the full result into the memory of a single node.

Daniel Darabos · Accepted Answer

Let's say you want to have the top 50% of an RDD.

def top50(rdd: RDD[(Double, String)]) = {
  val sorted = rdd.sortByKey(ascending = false)
  val partitions = sorted.partitions.size
  // Throw away the contents of the lower partitions.
  sorted.mapPartitionsWithIndex { (pid, it) =>
    if (pid <= partitions / 2) it else Nil
  }
}

This is an approximation — you may get more or less than 50%. You could do better but it would cost an extra evaluation of the RDD. For the use cases I have in mind this would not be worth it.

Spark RDD operation like top returning a smaller RDD

Answers (2)

Related Questions